This Page is Inserted by IFW Indexing and Scanning 
Operations and is not part of the Official Record 

BEST AVAILABLE iMAGiS 

Defective images within this document are accurate representations of the ori^al 
documents submitted by the appUcant. 

Defects in the images include but are not limited to the items checked: 

□ BLACK BORDERS 

□ IMAGE CUT OFF AT TOP, BOTTOM OR SffiES 

□ FADED TEXT OR DRAWING 

□ BLURRED OR ILLEGIBLE TEXT OR DRAWING 

□ SKEWED/SLANTED IMAGES 

□ COLOR OR BLACK AND WHITE PHOTOGRAPHS 

□ GRAY SCALE DOCUMENTS 

□ LINES OR MARKS ON ORIGINAL DOCUMENT 

□ REFERENCE(S) OR EXHIBIT(S) SUBMirTED ARE POOR QUALITY 

□ OTHER: 

IMAGES ARE BEST AVAILABLE COPY, 

As rescanning these documents will not correct the image 

problems checked, please do not report these problems to 

the IFW Image Problem Mailbox. 



(19) 




Europaisches Patentamt 
European Patent Office 
Office europeen des brevets 




(12) 



(11) BP 0704 796 A2 

EUROPEAN PATENT APPLICATION 



(43) Date of publication: 


(51) lntCI.6: G06F 9/46 


03,04.1996 Bulletin 1996/14 




(21) Application number: 95304 




(22>rDate Of filing: 1 6.06.1995^ , 




(84)-Designated Contracting States: 


(72) Inventors: 


AT BE CH DE ES FR GB IT LI NL SE 


^ Magee^ James Michael 




Lake Worth, Florida 33463 (US) 


(30) Priority: 28.09.1994 US 263313 


• Sotomayor, Guy Gil, Jr. 


West Palm Beach, Florida 33415 (US) 


(71) Applicant; Internationial Business Maciiines 


• Youngworth, Christopher Dean 


Corporation 


Savoy, lllinois 61874 (US) 


Armonk, N.Y. I 0504 (US) 






(74) Representative: Williams, Julian David 




IBM United Kingdom Limited, 




Intellectual Property Department, 




l^ursley Park 




Winchester, Hampshire S021 2JN (GB) 



(54) Capability bnginej method and apparatus for a microkernel data processing system 



CM 
< 

o 

o 

Q. 
LU 



(57) A microkernel (nterpr^^ 
system and method pfovi fast and efficient communi- 
cation between cii'ents ari^^ servers in uniprocessingV 
muitiprocessing, and distributed processing environ- 
ments. A microkernel operating system includes a capa- 
bility engine mcxiule that manages capabilities or rights 
to map regions 9^ 

tasks ' There wide range of port rights that can be 
attributed to a tasl< port; various permission levels, se- 
curity levels, priority levels, processor and resource 
availaBility, ete The capa^^^^^^^^ engine analyses th0 
rights' and selectjvel^ transfers bej^^^ fesks. 

In (his'mahhef, the capabirrfy e manages the inter* 
proce.ss^cc)rnmu.njcation .that^ p\Bce between 

the many ^ a Microkernel System, 

in a .fasland effic^^ 



FIG. 1 

. MEMORY. 102\ 



H05T;MULTIRROCESS0R ICO 



DOMINANT PERSONAUTY, 
APPLICATIONS 



ALItRNATE PERSONAUTY 
APPUCATJONS. 



152 X 



DOMINANT P6RSONAUTV 



DOMINANT 
PERSONALITY 
SERVER 



OTHER; 
DOMINANT 
PERSONAUTY 
SERVICES 



7/ 



ALTERNATE PERSONALITY 



ALTERNATE 
PERSONAUTY 
SERVER 



OTHER 
ALTERNATE 
PERSONAUTY 
SERVICES 



MULTIPLE 
PERSONAUTY 
SUPPORT 



MASTER 
SERVER 



INITIALIZATION 



....... v" ....... 


:. 










DEVICE 
:SUPPO*?T/ 




OTHER PNS" 
PRODUCTS ; 




DEFAULT 

-PASEH 




MULTIPU 
PERSONAUTY 
,SUPI*qHT 




SERVER \ ' 








NETWORK , , 
SE(?Vie£S 








DEVICE 
DRtVEIiS 




DATABASE : 
ENGINES 










SECURITY 



IPC 
122 


VIRTUAL 
MEMORY 
124 


MliGROKEf 
TASKS A ■ 
THREADS 
126 


(NEL 120 - ' - 

HOST AND 1 I/O SUPPORT 
PROCESSOR A INTERRUPTS 
SETS 128 130 


.MACHINE DEPENDENT CODE 1 2S 



AUXILLIARY 




I/O 




PROCESSOR 




PROCESSOR 


STORAGE 




ADAPTER . 




A 




B 


106 




10ft 




no 




112 



1 



EP 0 704 796 A2 

Description 

The present invention broadly relates to data processing systems and more particularly relates to Improvements in 
operating systems for data processing systems. ^ 

s The invention disclosed herein is related to the copending United States Patent Application by Guy G. Sotomayor, 

Jr.. James M. Magee, and Freeman L. Rawson, III, which is entitled "METHOD AND APPARATUS FOR MANAGEMENT 
OF MAPPED AND UNMAPPED REGIONS OF MEMORY I N A MICROKERNEL DATA PROCESSING SYSTEM", Serial 
Number 263,710, filed on the same day as the instant application, IBM Docket Number BC9 -94-053, assigned to the 
Intemational Business Machines Corporation, and incorporated herein by reference. 

10 The operating system is the most important software running on a computer Every general purpose computer must 

have an operating system to run other programs. Operating systems typically perform basic tasks, such as recognizing 
input from the keyboard, sending output to the display screen, keeping track of files and "directories on the disc- and- 
controlling peripheral devices such as disc drives and printers.- For more complex systems, the operating system has 
even greater responsibilities and powers. It nnakes sure that different programs and users running at the same time do; 

'5 not interfere with each other. The operating system is also typically responsible for security, ensuring that unauthorized 
users do not access the systenriL . r i ' ; ; 

Operating systems can be classified as muiti-user;operatingjsystems, multi-processor opeirating systems, multi-task-? 
ing operating systems, arid real-time bperatingf systems. A multi-user operating system allows two or more users to run 
programs at the same time: Some operating: systems permit hundreds or everi thousands of concurrent users. A mul- . 

20 ti-processing program allows a singiemser to ruhlwo or more programs at the same time. Each program being executed 
is called a process. Most multi-processing systems support more than one user. A multi-tasking system allows a single 
process to run more than one task. In common terminology, theiterms multi-tasking and multi-processing are often used 
interchangeably even though; they: have slightly^diffeirent meanings. Multi-tasking is the ability to execute more than one 
task at the same tirpe, a task being a program. In multi-tasking, only one central processing unit is involved, but it switches 

25 from one program to another so quickly that it gives the appearance of executing all of the programs at the same time. 
There are two basic types of multi-rtasking, preemptive and cooperative. In preemptive multi-tasking, the operating sys- 
tem parcels put CPU time slices to each program, I ri copperatiye rriyjti-tasking, each prpgrann can contrp! the CPU for 
as long as it needs it. If aprpgram is npt using the CpU, hovyeyer, it can allow another program to use it tempprarily. For 
example, the OS/2 (TM) and UNIX (TM) operating systems use preemptive multi-tasking, whereas the Multi-Finder (TM) 

30 operating system for Macintosh (TM) computers uses cooperative multi-tasking. Multi-processing refers to a compute^ 
system's ability to support moi-e than one process or program at the sarrie time. fOiulti-processirig operating sys^^ 
enable several progranris to run eoneurrently. Muk^^^^^^ systems are much more complicated than singie-process 

systems because the operating system must allocate resources to connpeting prc<;esses in a reasonable rn 
realltime operating system resjDonds to input instantaneously. General purpose operating systerns such as D^^^ 

35 UNIX are not real-time. j . . 

bperating systems provide a software platform on top of which aiDplicatior] prog 
grartis mu^tbe/specificaHy write particular operating system^ The choice of the operating 'system' 

therefore determihes tofa'g applications which can be run. For IBM compatible personal computers^" 

example operating systems are DOS, OS/2 (TM), AiX (TM), and XE ^ > . ^ ^ 

40 A user normally interacts with^the operating system through a set pf cpm^mands. Fo^^^ example, the DOS op^^^ 

systerfi contains commands such as COPY ancl RENAlME for copying files anS cliangin^ the^hames of file's, respectiveiy.^ ' 
Thefcommands:areiacGepted|and executed by^^a part iDf th^^ operating system caHed the cornmand processor or co 

linefinterpreter^^ _j |,vJSSr-^! \ I T '^l;;^,""' ij . '-C ZZr- i.'r. -Z-r: -. .^-'^ 

jTherelaife'H^ operating ^ystemS;for| personal computers such as CP/M (TM), DOS, OS/i2 (TM ^NIX 

45 (TN/f), XEf^lX:(fM), -andrAlX (TM). CP/M^ Was oh^ of :the first operating systems for small cornputers. CP/M was initially 
used on it W^^^ cornputefs; but itjwas eventually overshadowed by DOS. DOS runs on all'IBM 

cornpatible personal computers and^ i^ user, single tasking operating system. OS/2» a successor to DOS, is a 

rel4ively powerful operating |Syste^ that runs on IBM compatible personal computers that use the Intel 80286 or later 
microprocessor. OS/2 Is generally compatible with DOS but contains many additional features, for example it is mul- 

50 ti-tdsking and supports virtual meTO and UNIX-based AIX run on a wide variety of personal computers and 

work^ationsT systems for work stations and are powerful multi-user, 

multiTprocessing operating systems. 

In 1981 when the IBM personal computer was introduced in the United States, the DOS operating system occupied 
approximately. 10 kilobytes of storage. Since that time, personal computers have become much more complex and 

55 require rnuch larger operating systems. Today, for example, the OS/2 operating system lor the IBM personal computers 
can occupy as much as 22 megabytes of storage. Personal computers become ever more complex and powerful as 
time goes by and it is apparent that the operating systems cannot continually increase in size and complexity without 
imposing a significant storage penalty on the storage devices associated with those systems. 



EP 0 704 796 A2 

It was because of this untenable growth rate in operating system size, that the MACH project was conducted at the 
Carnegie Mellon Universiity in the 1980's: The goal of that research was to develop a new operating s;ystenn that would 
allow computed prdgrammers to exploit modern hardware architectures emerging and yet reduce this size and the number 
of features iri the kernel operating system. The kernel is the part of an operating systein that pierforms basic functions 

5 such as allocating hardware resources. In the case of the MACH kernel, five programming abstractions were established 
as the basic building blocks for the system. They were chosen as the ininimum necessary to produce a useful system 
on top of whibh the typical complex operations could be built externally to the k'erhel. The Ceimegie Mellon MACH kernel 
was t^educed in si2:^ in its release 3.0, and is a fully functional operating system called the MACH rtticfokernel. The 
MACR microkernel has the folbwirig priinitives: the talik, the thread, the port, the rTie^agi9, arid the membry object. The 

?£> task is the traditional UNIX process which is divided into twd sejsarate cbmponeints in the MACH microkei-neL The first 
component is the task, which contains all of the f^soui'ces for a group of cbope rating entities! ^ tesourbes 
in a task are virtual memory and communicatiohs poils. A task is a passive cbl lection of resouirbes; it does hot run on 
a processor 

The tHiread is the second component of the UNIX process, and is the active execution environment. Each task imay 

IS support one or more cohcurrehtly executing computations called threads. For example, a rhulti-threaded program may 
use brie thirbad td corhpUte scientific calcuiatibns while ahotHer thr-ead rtioiiitor^s the user interface! A mACH task may 
have many threads of execution/all running simultaneously. Mijch of the power of the MACH prdgrarnmirig mbdel cbines 
from the f^bt that all iKr^ads iri a task share the task'is resourdesi: Fbf instahbe, they air ha^e the samib virtual merndty 
(VM) address spade. Hbwbver, each thread in a task has it^ own private execLrtioh SMeriTiis^ a set of 

^ registers, such as gerie^^^^^ 

A port is the cornhiuriications chan^^^^ other A port is a resource 

and is owned by a task. A thread gains access to a pbft'by virtue of belbnging tb a t^sk. Cbbperating jDrograms may 
allow threads f rbrri one task to gain access to ports in another ^sk Ari irhportant feature is that they are Ibcatibh trans- 
parerit. This capability fkbilitates the distnb services over a rietwbrk withbut prdgraJm rribdificatibn/ 

25 Thb messatge' is used to enable threads in different tasks to bbmrriuriicate with bach other A message cbntaihs 

coll&tibns bf diata whibh are^g^ classes or types: This data can fanQe f rbrri program specific dat^ such as numbers 
or strings MACH related d^^ such as trahsferring capabilitieis of a port ffbnri one task to another 

A membry bbjbct is ah abstraction which supports the capability to perform traditbnal operating system fuhctioris 
in user level prograrns. a key feature of the MACH microkernel: For example, the MACH microkernel supports virtual 

30 memory paging pblldy iri a user level prc>gram. MerViory objects are an abstraction to support this capability. 

All of thes6 cbricepts are fundanriem MACH nhicrokehnel prbgrarrirhing rriodei and are used in thb kemei 

itself; These cbhcepts and other features of the Carnegie Metlbri University MACH microkerriel are de;scribed in the 
bobkby Jb^eph Bbykiri, etal," Prog ramming' Under MACH", Addison Wessieiy Pudiishihg Cbmpiany, Ihcbrpo^r^^^ 1 993. 
Additional discussions bf the use of a microkernel to support a UNIX personality can bb found in the article by Mike 

35 Accetta,' et al, "MACH Developririent", Prwbbdiiigs: of the SiM^^ i gsd tJSGNIX 

Cohfbrehce, Atlanta^ Georgia. Another is by David Gblut)^bt al^ "UNIX as an Applicatibn 

Prbg^m^^Proceediri^s brt^ 

The! above cited, copending patent application by Guy is. Sbtomayor, Jr, James M.'Magee, and Freeman L Rawson, 
III, describes the Micrbkernei System 115 shown in Figure T, which is a new fbuhdation for operating systems. Hie 

40 Micrbkeifher System 115 prbvide^ abbhcise sbt of kertiel services implemented id^ a pure kernel arid ah extensive set 
of services for building operating system personalities implemented as a set^pf user-level serve fe. T^^ 
System *l'l 5 is made ujp'ofm^ bompbn Vanbus trydffip^ dpbiiatihgi system fuhctibris'arld 

that ari 'manftest^ The Micrbkernei System ■l is us^s a client/server i^ysterh structure 

in i^icfi'^sks ((^ rbbue^ts of bther tasks' (^erVers) through messagbs sbht over a 

45 conlrliuriicatio^ thb microkerriel 120 provides vefy few seri^ice^ of its oWn (for example, it provides no 

file service), a m^^ 1 26 task ririust cbrhmunicate M ta^ks tharpirbyide the iequii'ed^^^ 

raises the problem 5^ manage the interprocess cbmmuhication that rhust take placb between thb many clibnts 

and sen/ers in the systerii. in a^^ 

In accordance with thb present invention, there is now prbvided a method for interprocess cbrrimunicatipri in a 

50 mibrbkerTiel architecture data processing system, comprising the steps of: loading a microkemel including a capability 
engirie module into a riiemory of the data processing system; forming with said microkernel a first task container in said 
memory having a set of attributes defining a first communication port and a first set of port rights, and having a pointer 
to a memory object, said first set of port rights conferring a capability on said first task container to access said memory 
object; forrriirtg with said microkernel a second task coiitairier in said memory having a set of attributes defining a sebbhd 

55 communication port aiid a second set of port rights; registering in said capability engine, said first set of port rights for 
said first task container and said second set of port rights for said second task container; comparing said first set of port 
rights and said second set of port rights in said capability engine; and enabling a transfer with said capability engine, of 
said pointer and said first port rights from said first task container to said second task container to confer onto said 



EP 0 704 796 A2 



10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



second task container said capability to access said memory object. 

Viewing the present invention from a second aspect, there is provided a subsystem tor interprocess communicatipn 
in a microkernel architecture data processing system, comprising: a micrbkernel in a memory of a data jproce^i^^^^^ 
system, including a capability engine module; a first task container in said memory liaving a set of attrifiutes defining a 
first communication port and a first set of port rights, and having a pointer to a memory object, said first set of port rights 
conferring a capability on said first task container to access said memory object; a second task container in said rnernc^ry 
having a set of attributes defining a second communication port and a second set of port rights; said capability engine 
registering said first set of port rights for said first task container and said second set pi port rights for said second task 
container; s§kj capability engine comparing ^^aid firet set of port: rights and ?aid se^^ 

capability enabling a transferpf said pointer anasaid first pprt rights^from said.first task coritainer'lo said second 
container tp confer onto said se^ " ' / " - 

Viewing the present invention frpm a third aspect, there is provided A systerri for inte^^ cOTmunicationjn a 
microkernel architecture, comprising a memory means in a data processing system, for storing data and programmed 
instructions; an interprocess communications means irj sa^^^ memory means, for coordinating message passing between 
tasks in said memory means; a first task in said memory meansfhaving a.set of attributes defining a first cpmrriunicatjpn 
port and a first set of port rights, and having a pointer to a merripry pbject. sajd fi r^t set of port rights conf erring a capab |l ity 
on said first task to access said nnemory object; a s^pond task in said memory means haying a set of attributes define 
a secxpnd conrimunication port and a second set ^^^^ means coupled to said memory m^^ 

exeputing said programmed instructbns; afirst thread in said mempry means associated with s^^ for^prpvidHng [ 

said programmed instructions for execution in said processor means; a .capability ^eng in e means in said inteiprcrce^^^^ 
communications means, for registering said first set of porl rights for sai^J first tasls aid said second setof port fighfe'^ 
said second task; said thread proyiding a messages from said first.ta^^ cornmunications mWris, fpr. 

providing said pointer; to said second task; sa^ capability engine means corrp^ said first set of port rights 
second set of port rights; and said capat)ility enginle rpeans enabling a transferpf pointer ^r^d said firsj port !:ights 
from said first task tp said second task to cpnfer orito said second task said capability to access said me 

Vieyving the preserit invention from ^ fourth aspect, there is prpvic^ed a system fpr interprocess 
a microkernel architecture, ppmprising: a mennory means in a data processing system, for storing data and programrned 
instructioris; a rpicrpkernel means in said memory rrieans, for coordinating operations between a pluralit^^ of tasks Jh 
said memory means; an interprocess communications means in said microkernel means, for coordinating rnessage 
passing between tasks in said memory mieans; a first task iri said rriqimory means having a set of atlribiJtes defining V 
first communicatipn port and a first set of port rights, and haying a pointer tp a memory object, said fir^t set of "f^rt rights 
conferring a c^p^biHtyprl said first task to acp^ss said memory object; a second task in said rrierridrY means hayi^^ 
set of attributes definirig a second cornrriunication port and a second set of port rights; a processor rneans coupied tp 
said rpenriqry .rneans. for ;^xecu^ instructions; a first thread in said memory means asspciaied)^^ 

said first task, for prpyicjipg said prpgranfu^^ for ^eiculionjn ^id prpcessor rneans; a capabi jity engine : 

mean^ in^^aid iriterproc^ss^^^ set pf port rigtos for said fire^ task and said 

second set of port rights for said second , task; said thread providing a message from siaid first task to said interpr^^ 
comrnunications means, for providing said pointer to said second task; said capability engine means corripa^^^ 
first set erf port rights and said second set of port rights; and said capability engine means enabling a iian^^ 
pointer and 5^id first port, rights from said first task to said second task to ^JPrTfer pnto said second jask said : . 
to acpe^s .sajd nne^ I-^^.-l. '/ ,17 - f'^'',*'^'"' 

..yi^y^in9|h^ ^ventK^n §^^t, therjp^^^roy iided a sysfe jrijpr Jrjte 

^^^Jnj! :!T;f^[W^^ iri^^ data,p|rpb^ sti3ring data pro- 

grammed instnjptjPn^ nnemory means, for coprdinaiting operations between'a 

task^ inlaid rpemory i^^^ means in said rnicrokemel means, for cot^^^ 

sag^ passing betw^p t^^fks in s^ a fir^ltgsk jn said nrienrio^ set of attributes deft^ 

a first cpmrriunicftion p^^^ ^'"^^^set of port rights, and haying a pointer to a^ memory object, said. fir^^ 
rights conferring a capability on said first task to access said memory object; a first processor means coupled to said ; 
memory means, for exeputing said programmed instructions; a first thread in said memory means associated with said 
fir^ task, for proyid^ means; a second task iri saici. ,^ 

memory means having a set of attributes defining a second communicatipn port and a second set of port rights; a second 
processor means coupled to said memory means, for executing said programmed instructions; a second thread in said 
memory means associated with said second task, for providing said programmed instructions for execution in said 
second processor means; a capability engine rneans in said interprocess communications means, for registering said . 
first set of port rights for said first task and said second set of port rights for said second task; said first threadprpviding 
a message from said first task to said interprocess communications means, for providing said pointer to said second 
task; said capability engine means comparing said first set of port, rights and said second set of port rights; and said 
capability engine means enabling a transfer of said pointer and said first port rights from said first task to said second 



EP 0 704 796 A2 



task to confer onto said second task said capability to access said mennory object. 

Viewing the present invention from a sixth aspect, there is provided a system for Interprocessor commuriication in 
a distributed processor system, comprising: a memory means in a first host system of a distributed processor system, 
for storing data and progranrinried instructions; a microkernel means in said memory means, for coordinating operations 
between a plurality of tasks in said memory means; an interprocess communications means in said microkernel means, 
for coordinating message passing between tasks in said memory means; a first task in said memory means having a 
set of attributes defining a first communication port and a first set of port rights, and having a pointer to a memory object, 
said first set of port rights conferring a capability on said first task to access said memory object; a first processor means 
in said first host system, coupled to said memory means, for executing said jprogrammed instructions; a first thread in 
said nnennory rneans associated task, for providirig| said programmed instructions tor execution in said first 

processor me^s; a second task in said rnertiory means having a set of attributes defining a second cbmmunicatidh 
port and a second set of port rights; a second thread in said memory rheans associated with isaid second task, for 
providing said programmed instructions for execution in said first processor means; a capability engine means in said 
interprocess cpmnriunications means, for registering said first set of port rights for said first task and said second set of 
port rights jpr said secorid task; '§atd first thread providing a message from said first task to said interprocess commu- 
nications misarfs, for providing said pointer to said second task; said capability engine hnearis comparing saiid first set 
of port rights and said second set of port rights; and said caipabiliiy engine means enabling a transfer of said pointer 
and said first port rights from said first task to said second task to confer onto said second task said capability to access 
said memory object; a communications link coupling said first processor in said first host system to a second host system 
of said distributed processor system; a second processor means in said second host system, coupled to said first proc^ 
essor means over said cornmunicattohs link; said second thread providing a reference to said pointer from sakii second 
task to said communications lirik, for providing said reference , to said pointer to said second processor means; said 
second processor means receiving said reference to said pointer over said communications link froni said first processor 
means, to enable said second processor rrieans to access said memory object in said first host system. 

The present invention this provides an improved microkemei architecture for a data processing system. Specifically, 
the preserit invention provides an irriproved microkernel architecture for a data processing system that is more simplified 
in its interprocess communication operations than has been capable in the prior art. the jDresent' invention this provides 
an improved nnicrokernel architecture for a data processing system, that has a faster and more efficient interprocess 
communication capabirrty. Still f urthermore, the present invention to provide an improved microkernel architecture for a 
data processing system, that has greater flexibility in the exchange of messages between tasks within a shared memory 
environment and between distributed data processors that do not s^ 

A preferred embodiment of the present invention provides a nnethod eind subsystem are disclosed for interprocess 
communication in a micrbkernei architecture data processing system, the data processirig system can bia a shared 
memory mLjftif)rocessir]g systeim or a uniprpcessor system. A microkernel operating systenn is loaded into the memory 
of the dat^ prociBssing ^ysterri- in accordance with the invention, the microkernel includes a capability engine midiile 
that manages capabilities or rights to map regions of the memory. A first task container is formed by the microkernel in 
the memory, having a set of attributes defining a first communicatiori port and a first set of port rights, and having a 
pointer to a memory object, the first set of port rights conferring a capability on the first task container to access the 
memory object. A second task container is formed in the memory having a set of attributes defining a second corrimu- 
nication port and a second set of port rights. , . . , , c 

In a preferred embodiment of the present invention, the capability' engine registers the first sefotport rights for' the 
first task container and the second set of port rights for the second task container^ The capability engine then compares 
the first set of port rights and the second set of port rights to determine if the second task can be allowed to gain access 
to the memory object. There is a wide range of port rights that can be attributed to a task port; various permission levels, 
security levels, priority levels, processor and other resource availability, and rhany others, limited only by the user's 
imagination. The capability engine analyses these rights and can selectively enable a transfer of the pointer and the 
first port rights from the first task container to the second task container to confer onto the second tiask container the 
capability to access the memory object. In this manner, the capability engine manages the interprocess communication 
that must take place between the many clients and sen/ers in the Microkemei System, in a fast and efficient nnanner 

The present invention applies to uniprocessors, shared memory multiprocessors, and multiple computers in a dis- 
tributed processor system. 

Preferred embodiments of the present invention will now be described with reference to the accompanying figures, 
in which: 

Figure 1 is a functional block diagram of the Microkemei System 1 1 5 in the nriemory 1 02 of the host multiprocessor 
100, showing how the microkernel and personality-neutral services 140 run multiple operating system personalities 
on a variety of hardware platforms; 



EP0 704 796 A2 

Figure 2 shows the client visible structure associated with a thread; 

Figure 3 shows ihe/cliem ' " ' . 

Figure 4 shows a typl^ 

Figure 5 shows a series of port rights^ contained in a port name space or in transit in a message; 
Flgurjei^^^ ' ' . ; . 

FigMre./ sf;M^s a functipna! 'fete^ of lhe^ hc^t;^^^ subsystem ai^ the 

capaBiliiy engi^^ communications between twolasks with threads running on Iwb^ prbces- 

^ots] [ I ^ ' ' ^- ■■■■ ■■ " ---J' ■ ■- 

Fjgure 8 shovys a functionai block d!agrarn oi Xvvp hosf multiprc^^^^ systems running in a distributed processing 
arranger]rtent, with the subsystem and the capatiility e on each host j^rdd^sso^mankgihg ih 
commynipate \^§H^, ^^^^ the exchange of message^ the two hosts oy^r a corhrnijriicatiOT 

Rgu^q ^ sh ^ ■ • - . ■ ^ > i 

Figure ID^^s^ an example message transfer using the capability engine and the message passing library; 
Figure 11 shows an example message transfer using the capability engine and tfie rhess&ge passing library; 
Figure 12 showjs an example niessage trjansfeV using the capability ^ 

Figurie 13 shows a^ using the capa^ and IhW message passirig library^ 

Figure 14 shows an example message transfer using the capability erigine and the nn^s^ge passing libra 
Figure 15 showsan example message^fe^ passing library; 

Figure 16 shows an example nni^sage fjrauri^fer usi^ ehgirie arrcl the W^ library^ ' ' ^ 

Figure shows a^ iibra^; " 

Figure 18 js \ / V . / : 

Figure 19 illustrates a ty 

Figure 20 illustrates a message control structure; . ■ -y^i ^iO^ ^ ? - ^ rf^H;,. 

^R||.ure 21 jn^^^ V . . ^ ' - , ' 

1f|iure2ffe - - - - - 

Fipure 0 illustra^ an exarnple of a riph-trusted client/knoym message ID; ? 
Figure 24 ii!^^^^ 

Figure 25 illustrates the overwrite buffer operation; 

Figure 26 illustrates an RPC transfer; " 



Figure 28 illustrates the basic execution loop of the multiplexing server; 



EP 0 704 796 A2 



10 



50 



55 



Figure 29 is the message passing library anonymous reply algorithm; 

Figure 30 illustrates share, region initialization; 

Figure 31 illustrates share region usage in the RPC common case- 
Part A., The fylicrpkernel System 
Section 1. Microkemel Principles 



Figure 1 is a functional block diagram of the Microkemel System 115, showing how the microkernel .120 and per- 
sonality-neutral services 140 run multiple operating system personalities 150 on a variety of hardware platforms. 

The host multi-processor 100 shown in Figure 1 includes memory 102 connected by means of a bus 104, to an 
auxiliary storage 106 which can be for example a disc drive, a read only or a read/write optical storage, or any other 
'5 bulk storage device. Also connected to the bus 1 04 is the I/O adaptor 1 08 which in turn may be connected to a keyboard, 
a monitor display, a telecommunications adaptor, a local area network adaptor, a modem, multi-media interface devices, 
or other J/p devices. Also. connected to the bus 104 is.a first processor A, 110 and a;Second processor B, 112. lY\e 
example shown in Figure 1 is of a symrnetrbal rriulti-prpcessor cpnfigur^^ 

112 share a common memory address space 102. Other configurations of single or nriultiple processors can ^be shown 
20 as equally suitable examples. The processors can be, for example, an Intel 386 (XM) CPU, Intel 486 (TM) CPU. a 

Pentium (TM) processor, a Power PC (TM) processor, or other unj-processor devices. 

The memory 102 includes the microkemel system 115 stored therein, vyhich comprises the microkemel 120, the 

personality neutral services (PNS) 140, and the personality servers 150. The microkernel system 1 1 5 serves as the 

operating system for the application programs 180 stored in the memory 102. 
25 An objective of the invention is to provide an operating system that behaves like a traditional operating system such 

as UNIX or OS/2. In other words, the operating system will have the personality of OS/2 or UNIX, or some other traditional 

operating system. 

The microkernel 120 contains a small, message-passing nucleus of system software running in the most privileged 
state of the host multi-processor 100, that controls the basic operation of the machine. The microkernel system 115 

30 includes the microkernel 1 20 and a set of servers and device drivers that provide personality neutral services 1 40: As 
the name implies, the personality neutral servers and device drivers are not dependent on any personality such as UNIX 
or OS/2. They depend on the microkernel 120 and upon each, other. The personality: servers 150 usejhe message 
passing services pf the microkernel, 120 to com^ personality neutral services For example, UNIX, 

OS/2 pr any other perspnality seiye^r can siend a rriessage to,a,persona|ity neutral disc.driyer and ask it to read a block 

35 of data from the, disc. The disc, driyer reads the blopk and returns it. in a message The. message system is optimized so 
that largaamountsiof d^ 

By virtue of its size and ability to support standard programming sen/ices and features as application programs, the 
microkemel 120 is simpler than a standard operating system. The microkernel system 1 15 is broken down into modular 
pieces that are configured in a variety of ways, permitting larger systems to be built by adding^^^ thelsrnaller 
^0 ones. For example, each personality neutral server 1 40 is logically separate and can, be.epn^ 

E3ch;Ser^ej;^runs]as,an^^^ a 
separatataskarjd^e^ ^ , : - - : ; - 

r gigure T^showsjthe.micro^ including the interprocess cornmunications module (IPCp) 1 22,. the virtual mem- 

ory module ,124,. tasks^and thr^eads module 1 26, the host and processor sets, 1 28, I/O support and interrupts 1 30. and 
45 machine dependent code 1 25. 

The personality neutral services 140 shown in Figure 1 includes the multiple personality support 142 which includes 
the master server, initialization, and naming. It also includes the default pager 144. It also includes the device support 
146 which includes multiple personality support and device drivers. It also includes other personality neutralproducts 
148, including a file server, network services, database engines and security. 

The personality servers 150 are for example the dominant personality 152 which can be. for example, a UNIX 
personality. It includes a dominant personality server 1 54 which would be a UNIX server, and other dominant personality 
services 155 which would support the UNIX dominant personality An alternate dominant personality 156 can be for 
example OS/2. Included in the alterriate personality 1 56 are the alternatepersonality seryer l 58 which would characterize 
the ps/2 personality,,and other alternate perso^^^ 

Dominant personality applications 182 shown in Figure 1 , associated with the UNIX dominant personality example, , 
are UNIX-type applications which would njn on top of the UNIX operating system personality 152. The alternate per- 
sonality applications 186 shown in Figure 1, are dS/2 applications which run on top of the OS/2 alternate personality 
operating system 155. 



EP 0 704 796 A2 



Figure 1 shows that the Microkernel System 115 carefully splits its implementation into code that is completely 
portable from processor type to processor type and codd that is dependent on the type of processor in the jDarticular 
machine on which it is executing. It also segregates the code that depends on devices into device drivers; however, the 
device driver code, while device dependent, is not necessarily dependent on the processor architecture. Usin^ multiple 
5 threads per task, it provides an application environment that permits the use of multi-processors without requiring that 
any particular machine be a multi-processor. On uni-processbrs, different threads run at different times: All of the support 
needed for multiple processors is concentrated into the small and simple microkernel 120. 

This section provides an overview of the structure of the Microkernel System 115. Later sections describe each 
component of the structure in detail and describe the technology necessary to build a new program using the services 
^0 of the Microkernel System 115. 

The Microkernel System 11 5 is a new foundation for operating systems. It provides a comprehensive environment 
for operating system development with the^^^^^ ' - vip -;- . : ^ ic 

Support for rhultipio personalities ^ ■ 

Extensible memory management • 
?5 Interprocess comniunication i - 

Multi-threading ^ ' ^ ' • ' ]■ ^ • 

■ Multi^proefe^slng -^^ . ; -v. ^. ^ . 

The Microkernel System 115 provides aiconcisa^^ 
sive set of Services for building opefatihg systerri per^^ user-level servers: Objectives ' 

20 of the Microkernel System 115 include the following:' ^ ■ 

Perrnit m6itiple operating system pe^ : :r 

Provide common programming fbr low-level systdnri elements; such as' deviee^drivers and file sys^ 
Exploit parallelism in both operating system and user applications; 

Support larger potentially sparse a^ * 
25 Allow transparent network resource access; ■ ; . c: r = 

. Be^cbrrpatible with existing s^^ ehvirohrhents, such as OS/2 and UNIX; 3^ • ^ ^ 

' Portable (to 32^bitand 64-bit platforms). ^- ^ ^ ;e;v> ' v 

The Microkernel System 1 1 5 is based on the following concepts: 
User mode tasks perfornn in g ma[ny traditional operating system furictibns (for^example, file systern and network 
30 access); ^-^^-^^^^^ • - " ; v "v- ■ 'v-; ' ■ ■■ '-.o - 

A basic set of user-level run time services for creating operating systerhs; ■ 

A simple, extensible communicatiori kernel; / , ' ■ ■ 

An object basis with commuhicatioh channels as object references; and ■ ' , 

ft* client/server programming model; usih arid asyhchrdhous inter-process coiiinriuriicatibn;- 

35 The basis fdr^the Microkferriel^System 115 is to provide a-siftiple; ext^?isiBle cornrhuhicatioh kernel It 1^ an objective^ 

of the Mierokeriiel SysteirTi l l S^tb periiift the fle^ seiVices in either user or kerh el space Vvith the 

'minimum amount of fUnletiorY In trie kernel jDroper 

nicatidn, iricludirig:^ ^^^^^^^^^^^^^^^^ ^^^^ ^ " " ^ ^ ^ ■ v ^ . 

Management of points of control (threads); - v 

40 Resource assignment (tasks); ^ v v : ^ ^ •• ' .v'-'- ^- ' ^ ^ 

-^Support* of address^spaete^^^^ Vil^:>.:j^yoi \o-^n(M l^-mjon vbu^tio^^'kv.; ia-:- -^ki^-i^:m \o"i 

Maihageitvehtdf physii^ 

User mode tasks implement the policies regarding resourde usage^-THe sirriply provides meb 

enforeeHh6se*p6licies>LbgiMi^^^^^^^^ provide 
^ a C nJhtirhie ehvironrrrent; ihcludthg s^ - ^ 

NarriaServer - Allows a client tolirid^a server^ ■ f j . . - ■^K.y- -' :■ ■-.■'^.^^ ■r^:- ■ . ■ 

Master Server - Allows progranis to be Ibaded and started 

Kernel Abstractions v 

One goal of the Microkernel System 115 is to minimize abstractions provided by the kernel itself, but not to be 
minimal tri the semantics associated with those abstractions. Each of the abstractions prbvided has a set of semahtics 
^ associated with it, and a complex set of interactibns with the other abstractions. This can 'make it diffieiiltlb identify ^ 
ideas. The main kernel abstractions are: : ^ ; 

Task - Unit of resource allocation, large access s^ 

Thread - Unit of CPU utilization. lightweight (low overhead) : v 



EP 0 704 796 A2 



Port - A communication channel, accessible only through the send/receive capabilities or rights 
Message - A collection of data objects 

Memory object - The internal unit of memory management (Refer to Section 2, Architectural Model, for a detailed 
description of the task, thread, port, message and memory object concepts). 

Tasks and Threads 

The Microkernel System 115 does not provide the traditional concept of process because: All operating system 
environments have cons^ semantics associated with a process (such as user ID, signal state, and so on). It is 

not the purpose o^ understand or provide these extended semantics/Many systems ecjua^^ a process 

with ari execution point 6^^ Some systems dp not. The microkernel 120 supports m control sep- 

arately from the operating system environment's process. The microkernel provides the following two concepts: 

Task 

Thread 

(Refer to Section 2, Architectural Model, for a detailed description of the task and thread concept's). 
Memory Management 

The kernel provides spnne memory management. Memory is associated with tasks. Memory objects are the nneans 
by which tasks take control oyer memory management. The Microkernel System 11 5 provides the mechanisms to support 
large, poteritially sparse virtual address spaces. Each task has an associated address nriap that is maintained by the 
kernel and controls the translation^ address in the task's address space into physical addresses. As in virtual 

memory systerris, the conte of the entire address space of any given task rnight not be completely resident in physical 
memory at tHe same time, and mechanisms must exist to use physical mernory as a cache for the virtual address spaces 
of tasks. Orilike traditional virtua Microkernel System 115 does not implement all of the caching 

itself. It gives user mode tasks the ability to participate in these nnechanisms. The PNS include a user task, the default 
pager 144, tKat provi^^^ 

Unlike other resources in the Microkernel System 115 .virtual merriory is not referenced, using ports. Memory can 
be referenced dniy by usirig virtual addresses as indices into a particular task's address space, the memory and the 
associated address map that defines a task's address space can be partially shared with other tasks. A task can allocate 
new ranges of rriernpry within its address space, de-allocate them, and change protections on them. It can also specify 
inheritance properties for the ranges. A new task is created by specifying an existing task as a base fronri vyhich to 
constmct the address space for the nl^w task. The inheritance attribute of each range of the rriempry of the existing task 
detepTiirieisly^ the new task has thai range defined and whether that range is virtually copied or shared with the 
existing task. Most v^^ copy operatipr^s for memory are achieved through cppy-on-write optimizations. A copy-bn-wf ite 
optimizatipn is accom by protected sharing^ The two tasks share the memory to be copied, but with readnDnly 

access. Whe^^ either task attempts to rfbd'rfy a portion of the range, that pprtion is copied at that time. Thiis lazy evaluation 
of nriemofy copie,^ is an important performance optimization performed by the Microkernel System 115 and impbrltarit to 
the cpnrimunicatip^^ 

Any given region of memory is backed by a memory object. A memory manager task provides the policy governing 
the relationship between the image of a set of pages while cached in memory (the physical memory contents of a memory 
region) and the image of that set of pages when not cached (the abstract memory object). The PNS has a default nnenipry; 
manager or pager that provides basic non-persistent memory objects that are zero-filled initially and paged against 
systempagln^ . _ ^ , , . . 

Task io Task !C^^^ 

. the Micrpkemerk^ 1 1 5 uses a client/server systerti structure in which tasks (clients) access services by making 
requests of other tasks (servers) through nnessages sent over a communication channeL Since the microkernel 120 
provides very few sen/ices o! its own (for example, Jt provides no file sen/ice), a microkernel 1 20 task must communicate 
with many other tasks that provide the required services, the communication channels of the interpretress communi- 
caiibn (IPC) nnechariism are called ports. (Refer to Section 2, Architectural Model, for a detailed description of a Pbrtj; 
A message is a collection of data, memory regions, and port rights. A port right is a name by which a task, that holds 
the right, names the port! A task can manipulate a port only if it holds the appropriate port rights. Only one task can hold 
the receive right for a port. This task is allowed to receive (read) messages from the port queue. Multiple tasks can hold 
send nghts to the port that allow them to send (write) messages into the queue. A task communicates with another task 
by building a data structure that contains a set of data elements, and then performing a message-send operation pn a 
port for which it holds a send right. At some later time, the task holding the receive right to that port performs a mes- 
sage-receive operation. 

Note: This.message transfer is an asynchronous operation. The message is logically copied intoJhe receivingJask 



EP 0 704 796 A2 



10 



IS 



20 



25 



30 



35 



40 



50 



55 



(possibly with copy-on-write optimizations), fyiuitiple threads withln^the receiving tasit can be attempting to receive mes- 
sages from a given port, but only one thread will receive ahy'given message: 

Section i2l Architectural Model 

The Microkernel System 115 has, as its primary responsibility, the provision of points of control that execute instruc- 
tions within a framework. These points of control are called threads. Threads execute in a virtual environment. The 
virtual environnnen^^ by. ?he kernel contains a virtual propesspr that executes all of the user space accessible 

hardware;inst^^^ augmented by ^ emulated^^ traps) provided by the kernel. 

The virtual processor accesses a set of virfualized Vegisters and sortie vitluarmemo^ that dthei-wise resbdn'ds as does 
the machine's physica^^^ special cbmbinatidhs of nhem- 

ory accesses and e^ Note thaV alj reisdurces provided by the kernel are virtualized! This section 

describes the top level elements of the virtual ^ 



Elements of the Personality Neutral Services (PNS) 

The PNS 1 40 portion of the^Mic^okern^l Syst^ consists.of sen/ices built on the underlying microkemel 1 20. ^ 
This; provldes'^s^^ functions that the kernel itseft^ as well as a basic set of usef-ieye! sen/iceis for the 

construfctipn of pro can serve requests from^ clients and 

are us ed to construct th e operatin g systerii persohalities tfi erTise ly es. 1 n additibri , the re iis an AN SI C ru n t inrie environ mierit 
for tliaconstru^^^ and some supplem^^^ that have definitions t^eh f 

the PdSiX starida^^ Besid%the jibraries that thernseives, there are many libraries that exist within the 

PNS that a/e a represent the inteHa^^^^ 

the sup)f)ort 115 interproc- 

ess comiriunicationsfaclli^^^^ strucfure of the f^KlS ehvironnnenitlibjB^^ hides the details of the implementatibh of 

each seryice from its callers. Some libraries, such as one of the p ruri time libraries, implernent all of their functions as 
local routines th^^^ the address space of the caller while 6th eif libraries consist of stubs that invoke the 

micfofemeTs iPCi^ send mess^^^^ archited^^^ permits the rteilbje of function: 

sen/ers can be replaced by other servers and selytpes can be cbmbiried into single tasks without affecting the sources 
of thepfdgraiTis^^^ 

system. Instead Tlie dominant persohality 152, that i^ 

duringsystem , is' the operating syste^^ user interface on the systern arid provides 

services to its clients and to elem of the PNS. th^ is a seiver ot laist Vesprt". The dbmih^ht 

per&pnality im services are defined b^ but are hot irnplemerite^ by another sefVer 

The micirokei-nel 1 20 is also dep'ehderit oh sdnne elennehts of the PNSV there are biases when it sehds nie4'saqes 
to persora^^^ operations^ Fqr example, in resolving a page faiilt, the micro- 

kem^el 120 ifiay send a message to the deMultp^^^ that the kernel 

needs from a hard disk. Although the page fault is usually being resolved on behalf of a user task, the kerner is the 
sender of the message. ^ 

The PNS run time provides a set of ANSI C and POSIX libraries that are used to support a standard C programming 
environment for programs executing in this environment. The facilities include typical C language constructs. Like all 
^ systems, the microkernel system 115 has, as its primary responsibility, the provision of points of cdntrdr that execute 
instmctions yyithin a f 120, points of control are called threads. Threads execute in a virtual 

environment: The virfuar enviroriment p^^^ the microkemel 1 20 consists 6f"a viil'ual processor that executes all 

of the user space accessible h^ (system traps) provided by the 

kerne j; the virtual jprcx^esso^^ set cif virtualized registers and some virtual, menribry that othenwise. responds 

as does th^^ physical rpem^^ resources ar accessible only through special combinations ^ 

of rhemory accesses and emulated instructiohs. Note that all resources provided by the rriicrokemel are v^ 
sectioi^ describes the top level elements of the virtual en vlrofinneht seen by the microkernel threads. 



Elements ot the Kernel 

The rhicrbkernel 1 20 provides an erivirbnment consisting of the elements described in the following list of Kernel 
Elements: ' ' 



EP 0 704 796 A2 



70 



IS 



20 



25 



30 



Thread: 



Task: 



Security Token: 

Port: 

Port Set: 

Port Right: 

Port Name Space: 

Messagje ^ 

Message Queue: 
Virtual Acidress Space: 

Abstrapt .Merriory Object: 

Memory Object Representative: 



35 



An execution point of control. A thread is a lightweight entity. Most of the 
state pertinent to a thread is associated with its containing task. 

A container to hold references to resources In the form of a port name space, 
a virtual address space, and a set of threads. 

A security feature passed from the task to server, which performs access 
validations. 

A unidirectional communication channel between tasks. 

A set of ports which can be treated as a single unit when receiving a message. 

Allows specific rights to access a port. 

An indexed collection of port names that names a particular port right. 

A collection of data, memory regions and port rights passed betweeh two 
tasks^ 

A queue of messages associated with a Single port. 

A sparsely populated, indexed set of merhoiry pEiges that can'be referenced 
by the threads within a task. Ranges of pages might'have arbitrary attribu 
arid semantics associated with them through nrtebhanisms inniplemented^ 
the kernel and external memory nrianagers. 

An abstract object that represents the non-resident state of the memory rang- 
es backed by this object. The task that implernehts this object is calied a 
memory manager The abstract memory object port is the port through Which 
the kernel requests action of the memory manager. 

The abstract representation of a memory objebl provided by the rhemory 
manager to clierits of the memory object. The representative names the as- 
sociated abstract merfiory object and limits the potential access modes per- 
mitted to the client. 



40 



45 



50 



Memory Cache Object: 



A kernel object that contains the resident state of the memory ranges backed 
by an abstract memory object . It is th rough th is object thdt th e rhertiory Hiari- 
aqer manipulates the^clierits'^^y^^ 



Processor' ' --- - - A physicarpfoce^^ 



Pr(^ess6r Set: 



Host: 
Clock: 



A set of processors,' each of which can be used to eixecute the tHr'eads^as- 
signed to the processor set. ' 

The multiprocessor as a whole. 

A representation of the passage of time. A time value incremented at a con- 
stant frequency. 



55 



Many of these elements are kernel impiemented resources that can be directly manipulated by threads. Each of 
these elernents are discussed in detail in the paragraphs that follow. However, since some of their definitions depend 
on the definrtions of others, some of the key concepts are discussed in simplified form so that a full discussion can be 
understood. 



EP 0 704 796 A2 



Threads 

A thread is a lightweight entity. It is inexpensive to create and requires low overhead to operate. A thread has little 
state (mostly its register state). Its owning task bears the burden of resource management. On a multiprocessor it is 
s possible for multiple threads in a task to execute in parallel, Eypn when parallelism is not the goal, multiple threads have 
an advantage because each thread can use a synchronous programming style, instead of asynchronous programming 
with a single thread attempting to provide multiple seryicets. A thread contains the following features: 

1 . a point of control flow in a task or a stream of instruction execution ; 

2; access to all of the elements of the contai^^^^^ 

3. executes in parailel with other threads, even threads within the same task; and 

IS 4. minimal state for low overhead. 

A thread is the basic corhputational entity. A thread belongs to only one task that defines its virtual addfess space. 
To affect the structure of the address space, pr to reference any resource other than the address space, the thread must 
execute a special trap instruction, this causes the kernel to perform operations on behalf of the thread, or to send'a 

20 message to an agent on behalf of the thread. These traps manipulate resources associated with the task containing the 
thread. Requests can be made of .the kernel to ijianipulate these entities: to create and delete them and affect their 
state. The kernel is a managerlhat provides resources (such as those listed above) and services. Tasks may also provide' 
services, .and implement abstract resources. The kemel provides communication methods that allow a client .task to 
requesUhaVa sierver tas^ service. In this way, a task has a dual ideritity 

2S One identity is that of a respurce nnanaged .b the kernel, vyhose resource manager executes within the kernel. The 
second identity that of a supplier of resources for which the reso^^^ is the task itself. 

A thread has the following state: 

1. jts machine state (registers, etc.), which change as the thread executes and which can also be changed by a 
30 hpld.ecpfihe kernel thre^^^ V , 

2. A small set of thread specific pod rights, identifying the thread's kernei port and ports used to send exception 
messages on behalf of the thread; 

35 3, A sujspend cpunt, non-zerp if the ihread is not to execute instructions; and 

4. Resource scheduling parameters. 

A thread operates by executing instruction in the usual way. Various special instructions trap to the kernel, to 
40 perform operationis on behalf pHhe^ traps is the mach_msg_trap. This trap 

allows the thread to send messagels to the kernel ahcJ other s'ervers t& ojDera^^ upon resources. This trap is almost never 
directly called; it is invoked through the,rpach^msg Jto conditions, such as "floating point over- 

flow" and "page not resident^ that arise during the Yhr6ad^ are handled by sending messages to a poil. The 

port use^d^ depends, on the The putcome of the exceptional condition is determined by setthg 

45 the threacf s state ahd/br respbndfng to the exception nnessage. T operatbns can be performed on a thf ead: 

Creation and destruction. 

Suspension and resumption (manipulating the suspend count); . 

Machine state manipulation Special port (such as exception; port) manipulation; and 

Resource (scheduling) Qpntrol. 

50 

Tasks 

A task is a collection of system resources. These resources, with the exception of the address space, are referenced 
by ports. These resources can be shared with other tasks if rights to the ports are so distributed. Tasks provide ^ large, 
55 potentially sparse address space, referenced by machine address. Portions of this space can be shared through inher- 
itance or extemal memory management. Note: A task has no life of its own. It contains threads which execute instructions. 
When it is said "a task Y does X" what is meant is "a thread contained within task Y does X". A task is an expensive 
entity. All of the threads in a task share everything. Two tasks share nothing without explicit action, although the action 



EP 0 704 796 A2 



is often simple. Some resources such as port receive rights cannot be shared between two tasks. A task can be viewed 
as a container that holds a set of threads. It contains default values to be applied to its containing threads. Most impor- 
tantly, it contains those elements that its containing threads need to execute, namely, a port name space and a virtual 
address space. 

the state associated with a task is as follows: 

The set of contained threads; 

The asspciated yirt^^^ 

The associated port name space, nanning a set of port rights, and a related set of port notificatiori requests; 

A security tokq^^^^^^ . - : 

A snnall set of task specific ports, identifying the task's kernel port, default ports to use for exception handling for 
contained threads, and bootstrap ports to name other services; 

A suspend count, non-zero if no contained threads are to execute instructions; 

Default scheduling parameters for threads; and 

Various statistics, including statistical PC samples. 
Tasks are. created by specifying a prototype task which specifies the host on which the new task is created, and 
which can supply by inheritance various portions of its address space, . 
The following operations can be performed on a task: 

Creation and destruction 

Setting the security token 

Suspension and resumption 

Special port manipulation 

Manipulation of contained threads 

, Manipulation of the scheduling parameters 

Security Port 

All tasks are tagged with a security token, an identifier that is opaque from the kernel's point of view, It encpdes the 
identity and other security attributes of the task. This security token is included as an implicit value in all messages sent 
by the task. Trusted servers can use this sent token as an indication of the sender's identity for use in making access 
mediation decisions. A task inherits the securrty token of its parent. B^caiJse this token is to be used as an unforgeable 
indication of identity, privilege is required to change this token. This privilege is indicated by presenting the host security 
port. 

A reserved value indicates the kernel's identity. All messages from the kernel carry the kernel identity, except ex- 
ception messages,, y\rtiich carry the e 

Port 

A port is a unidirectional communication channel between a client that requests a service and a server that provides 
the service A port has a single receiver and potentially multiple senders. The state associated with a port |s as f ollows;^ 

:^..Jts;associate;dme^£^g^^ ^.^ , .. - r.:=p^^^..,— -.VMn r-- ■ ->inm:^'"^ 

Aa>unt.crf,refe^^ -.-t,.^.. r-.^ ;>r^r^^/-: - ;r- i- 

Kernel servic.e^^^^^^^^ ports. Ail system entities other than yirtual menr»pry ranges are named by ports; 

ports are aisa c^ these entities are created. The kernpl provicjes nptificatipn rriessages upori the 

death of a port upon reciuest With the exception of the task's virtual address^ space, all o^^ sy^tenri resources are 
accessed through a level of ihdirection known as a port. A port is a unidirectional communication channel between a 
client who requests service and a server who provides the service. If a reply is to be provided to such a service request, 
a second port must be used. The service to be provided is determined by the manager that receives the message sent 
over the port. It follows that the receiver for pprts associated with kernel prpvided entities is the kernel. The receiver for 
ports associated with. task provided entities is the task providing that entity. For pdrts that name task proyided entities, 
it is possible to change the receiver of messages for that port to a different task. A single task might have multiple ports 
that refer to resources it supports. Any given entity can have multiple ports that represent it, each implying different sets 
of pemnissible operations. For example, many entities have a name port and a control port that is sometimes called the 
privileged port. Access to the control port allows the entity to be manipulated. Access to the name port simply names 
the entity, for example, to return information. There is no system-wide narne space for ports. A thread can access only 
the ports known to its containing task. A task holds a set of port rights, each of which names a (not necessarily distinct) 
port and which specifies the rights permitted tor that port. Port rights can be transmitted in messages. This is how a task 
gets port rights. A port right is named with a port narTie,-whieh is an integer chosen^b^ 



EP 0 704 796 A2 



within the context (port name space) of the task holding that right. Most operations in the system consist of sending a 
message to a port that names a manager for \he object be\ng manipulated. In this document, this is shown in the form: 

object function ' ' 

which means that the function is invoked (by sending ah appropriate message) to a port that names; the object! Since 

5 a message must be sent to a port (right), this operation has an object basts. Some operations reqiiire two objects, such 
as binding a thread to a processor set. These operations show the objects sepfarated by commas; Not all entities are 
named by ports, and this is not a pure object model. The two main non-port-right named entities are port hanries/righls 
themselves, and ranges of memory. Eyent objects are also named by task locafl lDs. T6 rhanipulate a rinemory range, a 
message is sent to the containing virtual Address space named by the oiwhirig task, to rh^Hipulatri a port namMight, 

10 and often, the associated port, a message is sent to the cohl^iping p6h^^^^^^ the'pwning t^sk. A 

subsxii^ipi hbtatior!? . - " ^ ' . . 

object [id] -> funbtiorl . - . ? 

is used here to show that an id is required as a parameter in the mbssage to indicate which range or elerfieht of object 
, is to be manipulated. The parenthetic notation, 

15 object (port) -> function ^ 

is used h^re to shbw that a privileged port, such as the hcwst cohtrdi port, is required as a parameter in the message to 
indicate sufficient privilege to manipulate the object in the particular Way^^ 

Port Sets 

20 . : - 

A port set is a set of ports that can be treated as a single unit when receiving a message. A mabh_nnsg i-eceive 
operation is allowed against a port name that either names a receive right, or a port set. A port siel eontaihis a cdliectipn 
of receive rights. When a receive operation is performed against a port set, a message is received frbrri one of the ports 
in the set. The received message indicates from which member port it was received. If is not alibwekJ to directiy ireceive 
^5 a message from a port that is a member of a port set. There is no concept of priority for the ports in a port set; there is 
no control provided over the kernel's choice of the port within the port set from which any given message is'^febfeived. 

Operations supported for port sets include: 

30 Creatibh* arid deletion 

" Membei'^hip d^^ 

Port Rights 

35 A port can only be accessed by using a port right. A port right allows access to a specific port in a specific way. 

There are three types of port rights as follow: 

receive right - Allows the holder to receive messages from the associated port, 
send right - Allows the holder to send messages to the associated port. 

sehd-once right - AiloWs the holder tb send a single message to thei associated port! The port right self-destructs 

40 after the message is sent. " . ^ - . . . . . 

Port rights can be copied and moved between tasks using various options in the maidhlmsy ball;''a^^ 
command. Other than message operations, port rights can be manipulated ohl^ as' merhfrefe oTai port nam^ spabei F*ort 
rights ire cf^atecl im^^ The kernel 

wi 1 1; upon request, jbrbvid^'n^^^^ tb a pari of on ^'s 'choosihg wheh th e re are ho more send rights^ to ia port. Also, 

45 the'destructidii'Gf a i^^^ it to serid a message^^^ sehd-dnce riotificatibri sent 

to th^ corfespbridiiig^^ Upon' request;' the kernel pi'bvides notif icatibn of the destructibn of a[ receive right: 

Port Name Space 

50 Ports and port right^ do not have system-wide names that allow arbitrary ports or rights to be manipulated directly 

Ports can be manipulated only through port rights, and port rights can be manipulated only when they are cbhtained 
within a port name space. A port right is specified by a port narhe which is an index into a port name sjDace. Each task 
has associated with it a single port name space. An entry in a port name space can have the following four possible 
values: 

55 MACHJ^OFTJ^ULL- No associated 

MACH_PORT_bEAD - A right was associated with this name, but the port to which the right referred has been 
destroyed. 

A port right - A send-once, send or receive right for a port. A port set name - A name which acts like a receive 



EP 0 704 796 A2 



right, but that allows receiving from multiple ports. 

Acquiring a new right in a task generates a new port name. As port rights are manipulated by referring to their port 
names, the port names are sometimes themselves manipulated All send and receive rights to a given port in a given 
port name space have the same port name. Each send-once right to a given port have a different port name from any 
other and from the port name used for any send or receive rights held. Operations supported for port names include the 
following: 

Creation (implicit in creation of a right) and deletion 

Query of the associated type 

Rename 

, Upon request, the kernel provides notification of a name becoming unusable. 
Since port name spaces are bound to tasks, they are created and destroyed with their owning task. 

Message 

A message is a collection of data, memory regions and port rights passed between two ent'rt A message is not 
a system qb]ec\ in Its own right However, sinc^ messages are queued, they are significatnt because they can hold state 
between the time a message is sent and received. This state consists of the fpllovying; 

Pure data 

popies pf memory ranges 
Port rights 

Sender's security token 
Message Queues 

A port consists of a queue of nnessages. This queue is manipulated only thrpugh message 9perations (mach_msg) 
that transmit messages. The state associated with a queue is the ordered set of messages queued, and settaWe limit 
on the number of messages. 

Virtual Address Space 

A virtual address space defines the set of valid virtual addresses that a thread executing within the task owning the 
virtual address space is allowed to reference. A virtual address space is named by its owning tasik, A virtual address 
space consists of a sparsely populated indexed set of pages. The attributes of individual pages can be set as desired. 
For efficiency, the kernel groups virtually contiguous sets of pages that h^ye the same attributes into internal mpmory 
regions. The kernel Is free to split or merge memory regions as desired. System mechanisms are sensitive to the Identities 
of memory regions, but most user accesses are. not so affected, and can span memory regipnis freely. A givep nrtempry 
range can have distinct semantics associated with it through the actions of a memory manager. When a new memory 
range is established in a virtual address space, an abstract memory object is specified, possibly .t|y defgu that repre- 
sents the semantics of the memory range, by being associated with a task (a memory manager) that provides those 
semantics. A virtual address space is created when a task is created, and destroyed when the task is destroyed. The 
initial contents of the address space is determined from various options to the task_create call, as well as the inheritanpe^ 
properties of the memory ranges of the prototype task used in that of call. K/lost operations upon a virtual address space 
nanrie,a,mempry range Mhin the add space. These pper^tions incl^ 
breiSing or a^^ 

Copyingaranga,.,^. ^^.^ . - r-^v-..-^ -„ ^ , . 

Setting spedal attributes. iocIuding^wiringMhe page Into phyfical me^^ , 
Setting memory protection attributes 
Setting inheritance properties 
Directly reading and writing ranges 
Forcing a range flush to backing storage 

Reserving a range (preventing random allocation within the range) 
Abstract Memory Object 

The microkernel allows user mode tasks to provide the semantics associated with referencing portions of a virtual 
address space. It does this by allowing the specification of an abstract memory object that represents the non-resident 
state of the memory ranges backed by this memory object. The task that implements this memory object and responds 
to messages sent to the port that names thememory j^iject Js_cMed a memory_m_anag_er The kern el should be yiewed 



EP0 704 796 A2 



as using main memory as a directly accessible cache tor the contents ot the various memory objects. The kernel is 
involved in an asynchronous dialog with the various memory managers to maintain this cache, fiilirig and flushing this 
cache as the kernel desires, by sending messages to the abstraict memory object pdrts. The operations upon abstnact 
memory objects include the following: 

Ihitializatioh 

Page reads 

Page writes 

Synchronization with force and flush operations 
Requests for permission to access pages 
Pagecopies 
Termination 

Memory Object Representative 

The abstract memory object port is used by the kernel to request access to the backing storage for a memory object. 
Because of the jDrbtected nature oflthis dialog, memory managers do hot tyjDically give access fe the abstract merriory 
object f>tirt td clients. Instead, ciierits are given access to memory object representatives. A memory object representative ^ 
is the client's representation of a memory object/there is cmly one operation permitted aga a port arid that is 

to map the associated memory object into a task's address space. Making such a request initiates a protocol between 
the mapping kernel and the memory manager to initialize the underlying abstract memory object. It isthrou^ this jspecial 
protocol that the kernel is informed of the abstract memory object represented by the representative, as well as the set 
of access modes permitted by the representative. 

Memory Cache Object 

The portion of the kernel's msiih memdry cache that contains the resident pages assbciated with a given abstraict 
merfibry objebt is referred to as the Vnemory cache object. The hnerribry rtianager for a rhemdry object holds sfend ng^ 
to the kernel's memory cache object. The memory manager is involved in an asynchronous dialog with the kerWel to 
provide the abstraction of its abstract memory object by sending messages to the associated memory cache object. 
The operatbns upon memory cache objects include the following: 

Set operational attributes 

Return attributes 

Supply pages to the kern 

Iridibate that pages requested by Ihe^ 

Ihdicatd that pa^es requested by the kerhel should be fillied by the kernel's defaiult rules F6fbe d6iayisd<c6p^^^ 
the object to be completed 

thdicate that pages sent to the memo 
Restr'ict access to memory pages 
Provide performance hints 

Terminate ^ ' . 

Processor ''^[ ' ^ ............. :. . ■ . . . .v .... .-. < . ' " ' . ' \ V ' ' V ' ' " ^' " ^ ^' [ y " " * " " ' '^^^^J^ 

Each physical processor that is capable bf executiiig thr^eafds is named by a prbcessbi- cohtrbr port: Aftt^bugh sig- 
nificant in that they perform the real work, processors are not very significant in the riiidrbkdrriel.' btherthari a^ 
of a processor set. It is a processor set that forms the basis for the pool of processors used to schedule' a set b'f threads, 
and that has scheduling attributes assbciated with it: ft^ for processors ihciudb the follbwirig: 

Assignment to a processor set 

Machine control, such as start and stop 

Processor Set 

Processors are grouped into processor sets. A processor set forms a pool of processors used to schedule the 
threads assigned to that processor set. A processor set exists as a basis to uniformly control the sbhedulability of a set 
of threads. The concept also provides a way to perlorm coarse allocation of processors to given activities in the system. 
The operations supported upon processor set^ include the fblloWirig: 
Creation and deletion 

As signnneht of processors - . - 



EP 0 704 796 A2 



Assignment of threads and tasks 
Scheduling control 

Host 

5 

Each machine (uniprocessor or multiprocessor) in a networked microkernel system runs its own instantiation of the 
microkernel. The host multiprocessor 100 is not generally manipulated by client tasks. But, since each host does carry 
its own microkernel 120. each with its own port space, physical memory and other resdurcies, the executing host Is 
visible and sometimes manipulated directly. Also, each host generates its own statistics. Hosts are named by a name 
10 port which is freely distributed and which can be used to obtain Information about the host and a control port which is 
closely held and which can be used to manipulate the host. Operations supported by hosts include the following: 
Clock manipulation 
Statistics gathering 
Re-boot 

15 Setting the default memory manager 

Obtaining lists of processors and processor sets 

Clock , 

20 A clock" provides a representation of the passage of time by incrementing a time value counter at a corist^ht fre- 

quency. Each host or node In a multicomputer implements its own set of clocks basekj upon the various clbQks and 
timers supported by the hardware as well as abstract clocks built upon these timers. The set of clocks implemented by 
a given system is set at configuration time. Each clock is named by both a name and a control or privilegpd port. The 
control port allows the time and resolution of the clock to be set. Given the name port, a task can perfom the followihg: 
25 Determine the time and resolution of the clock. 

Generate a memory object that maps the time value. 

Sleep (delay) until a given time. 

Request a notification or alarm at a given time. 

30 Section 3. Tasks and Threads 

This section discusses the user visible view of threads and tasks, threads are the active entities in the Microkernel 
System 115. They act as points of control within a task, which provides them with a virtual address space and a port 
name space with which other resources are accessed. 

35 ^ " ' . ^ ' . ■ ; ' ^ - 

Threads^ . _ ^ ■-■>:-,^^ 

A thread is the basic computational entity. A thread belongs to only one task that defines its virtual address space. 
A thread is a lightweight entity yyith a minimum of state. A thread executes in the way the hardvirare, ^etcl^n^^ 

40 instructions from Its tasl^s^dre^s sf^a^ ;based on tb^^ aqtionsjal^ ^ 

directly are to efeute iristfucliohs that m i& registers and read and wrfte into its merfipry space; M'£^ | 

to execute priyileged machine instructions, though, causes a exception. The exception is discussed lateir; To affect the 
structure of the addi^ss rWerence any r^^ the address; space, the thread must execute a 

special trap instriicti^^ pertorrn oji^^ behalf of the threaid; or to send a menage to 

45 some agent on behalf d Also, ' laufts or oth^ instruction behayidr bause the kernel to Invoke 'its ex- ; 

ception processing. [ . . .. • 

Figure 2^ shows the client visible structure associated with a thread, the thread object is the receiver for messages 
sent to the kernel thread port. Aside from any random task that holds a send right for this thread port, the thread port Is 
also accessible as the thread's thread self port, through the containing processor set or the containing task. 

50 Reference is made here to the above cited copending United States Patent Application by Guy G. Sotomayor, Jr , 

James M. Magee. and Freeman L. Rawson, III. entitled "METHOID AND APPARATUS FOR MANAGEMENT OF 
MAPPED AND UNMAPPED REGIONS OF MEMORY IN A MICROKERNEL DATA PROCESSING SYSTEM", which is 
incorporated herein by reference for its more detailed discussion of this topic. 

55 Tasks 

A task can be viewed as a container that holds/ a set of threads. It contains default values to be applied to its 
containing threads. Most importantly; it contains those elements that-its containing threads need to execute. name 



EP 0 704 796 A2 



port name space and a virtual address space. ^ 

Figure 3. shows the client visible task structures. The task object is the receiver for messages sent to the kernel 
task port; Aside from any random task that may hold a send right to the task port, the task port can be derived from the 
task's task self port, the contained threads or the containing processor set. 
5 Reference is made here to the above cited copending United States Patent Application by Guy G. Sotomayor, Jr., 

James M. Magee. and Freeman L Bawson, III, entitled "METHOD AND APPARATUS FOR MANAGEMENT OF 
K/IAPPED AND UNMAPPED REGIONS OF MEMORY IN A MICROKERNEL DATA PROCESSING SYSTEM", which is 
incorpprated herein by reference for its more deta^^^^ ^ 

10 Section 4. IPC -^'^[^ 

With the exception of its shared memory, a microkernel task interacts with its environment purely by sending mes- 
sages and receiving replies. These messages are sent using ports. A port is a communication chanriel that fias a single 
receiver and can have multiple senders. A task holds rights to these ports that specify its ability to send or receive 
'5 messages. 



Ports 



A port is a unidirectional communication channel between a client who requests a service and a server who provides 
20 the service,. A port has a single receiver and can have multiple senders. A pprt that represents a kernel supported 
resource has the kernel as the receiver. A port that nanhes a service provided by a task has that task as the' port's 
receiver. This receivership can change if desired, as discussed under port rightls. The state associated with a port is: 
the associated message queue 
A count of references or rights to the port 
2S Port right and oul-ot-line memory receive limits 

Message sequence number 
Number of send rights created from receive right 
Containing port set 

Name of no-more-sender port if specified 
30 Figure 4 shows a typical port, illustrating a series of send rights and the single receive right. The associated message 

queue has a series of ordered messages. One of the messages is shown in detail, showing its destiriatlon port 
port, reference, a send- and-receiye right being passed in the message, as w^ll as some out-of-line or virtual copy 
memory. 'V" . . ' . . .. ■ . ^"^^ ■■^■y- "- ; ^ 

Few operations affect the port itself. Most operations affect port rights or a port name space contairiirig those Vights, 
35 or affect the message queue. Ports are created implicitly when any other system entity is created^ Also, rhach_repiy_port' 
creates a port. Ports are created explicitly by port_name_space [port„name] mach_port_allocate and 
port_name_space [port_name]-> mach_port_allocate__name. A port cannot be explicitly destroyed. It is destroyed only 
when the receive right is destroyed. 

The attributes of a port are assigned at the time of creation. Some of these attributes, such as the limit on the number 
40 of port f ights^or ampunt of put-pf -line memory^ in , a mess^^^^^ can be changed With ppr1_na^ 

[port_hame] mach rxirt set attributes. These tributes pari be (iltalned wrft^'poif riamd sbace f port name]^ 
mach j^ort get attributes. . ... . . , 

Tip e existence , of ^ is of pbyipys importance ^^^^^ tasks using a port iriay wish to be 

notif i ed, jh r^^ . a message , ' when th e port die's! Such hptif ic^ are req uested with an option to . rhach_msg 
45 (MAcRIrC^ as Wert as yvith po The resutt- 

ant deacl name notification indicates that a task's porl rarniB has gone dead because of the destruction of the riarried 
port. The message indicates the task^s name for the now dead port, (This is discussed, under port name spaces.). 

Messages 

50 / . " \ J' . ^ ' ^ ' ■ ' 

A message is a collection of data, out-of-line memory regions, and port rights passed betweien two entities. A mes- 
sage is not a manipulable system object in its own right However, because messages are queued, they are significant 
because they can hold state between the time a message is sent and the time it is received. Besides pure data, a 
message can also contain port rights. This is significant. In this way a task obtains new rights, by receiving therri in a 
55 message. A message consists of an Interprocess Communication (IPC) subsystem parsed control section and a data 
section. In addition, the message may point to regions of data to be transferred which lie outside the message proper. 
These regions may contain port rights (out of line port arrays). A message also carries the sender's security token. The 
control section of a message is made up of a header and optionally a message body and one or nhore message de- 



EP 0 704 796 A2 



scriptors. The header specifies a port name for the port to which the message is sent and an auxiliary port name for the 
port to which a reply is to be sent if a reply is requested. ^ 

The message body, when present, follows the header and declares the number of descriptors to follow. If there are 
no descriptors, the message is not considered "complex", meaning no overt translation or copying of data is required 
5 of the IPC sutDsystern. Non "complex" rnessages do not contain a message body. Each descriptor' describes a section 
of kernel manipulated data such as an put-of-line memory region, port right, and port right array The data contained in 
the rnessage data Section is treated as an anonymous array of bytes by the I PC subsystem . Both the serider and receiver ^ 
of the' message, must share a common understanding of th^ data format. For messages that originate iri a Message 
Interface Generator (MrG) generate4 r^^^ thei first eight bytes of data contain machine encoding in^^^^^ for 
10 pc«sible iritenrriachine conversion of data contained in t^^ . 

A descriptor for a single port right names the right as well as any special transformations perfdmned by nnach_msg, 
such as moving the right instead of making a copy or generating a right from a right of a different type. A descriptor for 
an "out-of-line" port array also specifies the IPC processing for the set of rights, which must all be of the same type, but 
specifies the address and size of an array of port names. An out-of-line data descriptor specifies the size and address 
15 of the put-of-line region. The region need not start at the beginning of a page nor contain an integral nuniber of^^f^^^^^ 
Only.the specified data Is bgically transmitted, TfU descriptor car) specify that the act of queuing the messagie wiH^^ 
de-ailpcate the merriory range, as if by vrn_deaIlocate, from the sending tasK. the sender is given the choice of se^^ 
a physical or virtual copy of the jdata. its virtual ODpy, 

copyTpn-write, nriechanisms to efficiently copy large aTO result, the receiver may see 

20 that js backed by a sencier memory manager. Both seh^ and receiver might experience indeterminacy In access time ^ 
to the 'memory because of the potential virtual copy mechanisms used, the choice of a physical copy guarantees 
terminlstic access by the sender and receiver to the data. Neither the values in the out-of-line diata regions or in the 
message data are typed. Only the receiver of a message, with the knowledge of the port over which it cancie and the 
message data, can interpret and possibly transform these data areas. 
25 _ _ ^ , , ^ ^ .. 

Message trailers 

When a message is received, a trailer is appended to the end by default. The trailer is generally added by the IPC 
subsystem and i^ physically contiguous with the message. The trailer contains various transmissipn-related fieldis 
30 as message siBquehce number and sender security 'token. The forrnat and size of the trailer are controllable via receive 
side options to mach_msg It is also possible as an option for the sender to stjp^^ 

Message Queues 

35 A port consists ot a queue of messages. The queue is manipulated only through message ope ratiori 

that transTTijt^ me^agi^^ the only cohtroHable state for a messagei qi^eue is rtf size. This cbih be set when 
port_namelspace [port_narne] -> mach_lport_set_attributes is given the receive right for the associate^^ If a rnes- 
sage queue is full, no more messages can be queued. The callers will block. Messages sent to a port are delrvered 
reliably Reception of a message guarantees that all previous messages were received and that they were received in 

40 the order in which they were queued on the receive port. .. .j . , i> '-u 

Port Rights ^ , _ , 

A port right is ari e^^^^ the right to access a specific port in a specific way. A port can only be accessed 

45 through a port right. In this context, there are three types of port rights as follows: 

receive right - Allows the holder to receive messages from the associalepi port. 

send right - Allows the hold^ ^ 
send-once right - Allows the holder to send a single message to the associated port. The right self-destructs after 

the message is sent. , . , 

so Port rights are a secure, IcKpalion independent way of idehtifyihg ports these rights are kernel protected entities. 

Clients manipulate port rights only through port names they have to these rights. 

Basic Manipulation 

55 mach„msg is one of the principal ways that rights are manipulated. Port rights can be moved between tasks, that 

is^ deleted from the sender and aclded to the receiver in messages. Option flags in a message can cause mach_msg 
to make a copy of an existing send right, or to generate a send or a send-once right from a receive right. Rights can 
also be forcefully copied or moved by port_name_space-[port^name]^ mach_port^extract_rig^ 



EP0 704J96 A2 



target sending the right in a message) and p9rt_name_space [port„name]-^ mach jiort Jnsert^right (the equivalent pt 
the target receiving the' right in a message). Besides message operations, port rights can be manipulated only as rriem- 
bers of a port name space. . ^ •. • . 

Figure 5. shows a series of port rights, contained in a port name space or in transit in a messagie. A port set is also 
shown in the port name space. Port rights are created implicitly when any other system entity is created; M^_^ 
creates a port right/ Port rights are created explicitly by pdrt_namOpace [pbrt_riarn^^^^ machj)orjt_allocate and 
port_name_space [port_name] mach_port_allpcate_nanne. A port right is destroyed by pprt_^^ 
[portJiameJ-^^rnachJjprLdea^^ port^name_space [pprt^ar^ej^^ mach^oit^^ 

be a by-product of port narne space manipulations,^ such as by p6rt„name_space [port_ri^ rriaich_^p6rt_m6d_re^^ 
Given ' a rec^e right, sorne status infornnatiori can be obtained with portJi^niO [poCnamej^^ 
mach_pprt_get_attributes. ^ 

No-More-Senders Notificatipn 

The s^stern mairitains a system-wide count of the number of serid and send-once rights for each This includes 
rights in trarisit in messages! including the destination and reply port rigfits: The receiver; of a rnay wish to know if 
there are nS rmre send rights for the port, indicate the port may no longer h^e value. Aribtifi^^ this form 
can be nequested using pdrtjname_space [porljriarrieH mach j^ort^requestj^ ort 
the concept bf a make-send count, discussed as a part of port riarne -spaces: The rripveriient to a^^ task of the 
receiv^ right cancds the o notificatiori ^request, a^^ sends a send;<5ricq notifi 

cate this caricellatioh' No-more-senders notification occUrs when the nuiriber of existing send rigl^ withdOt;; 
regards to the number of outstanding send-once rights. . . ..^^ . , ^^..^ .. 

Send-Once Rights v . ^ ^ 

A send-once right allows a single message to be sent. These rights are generated only from the receive right. A 
send-once right has the property that it guarantees that a message will result from it. In the normal case, a siend-once 
right is consumed by using it as the destination port in a message. The right is silently destroyed when the message is 
received, the send-once right cari be moved from task to task when it is; not being used as aBestinatm^r^^ 
time ^s1t is corisiirned. It the fight is destroyed in any way other man;; it to send a 

notification is sent to the port. Most of the ways in which a send-once right cain be destroyed, when it is not used, are 
fairiy obvious. There are two obscure cases: 

The send-once right was specified as the target for a no-senders notification and the port for which the no-senders 
notification was requested is deleted or the receive right rDOved. Since there will be no forth coming no-senders notifi- 
cation, a send-once nptification is generated instead . . ^ .... . ^ 

In the process of pertorming a message receive, the task gives away its rccoivo right aftW the rnessage is de-queu^^ 
from the port but prior to its being returned to the task (refer to the details of message transmission beiow). A sehd-ohce 
notification is sent to the destination port signifying the lost association between the message sent via the send- once 
right and the port. > , 

Port Name Space 

Ports are not named, but port rights are. A port right can only be named by being contained within a port^name" 
space. A port isspccified by a port name which is an index into a port narpe space. Each t^sk is associated with a port 
narhe space. " " ' ~ ; . , . .. . .. ---.r-;;, -r ^ ■ -^r-^ 

An entry in a port name space can have four possible values: ^ . , . ^ _ ..^^^ . 

MACH_PORT_.NULL - No associated port right. , ; , \ \ ;r 1; . 

lyiACH J=>pRT„DEAD - A right was associated with this name, butthe port to which the right referred is riow dead. 
The port name is kept in this state until explicit action is taken to avoid reusirig this name before the client task understands 
what happened- a port right -A send-once, send or receive right for a port, a port set name - A name which acts like a 
receive right, but that allows receiving from multiple ports. This is discussed In the next section. 

Each distinct right that a port name space contains does not necessarily have a distinct name in the port name 
space. Send-once rights always consume a separate name for each distinct right. Receive and send rights to the.same 
port coalesce. That is, if a port name space holds three send rights for a port, it will have a single name for all three 
rights. A port name has an associated reference count for each type of right such as send-once, send, receive, port set, 
and dead name, that is associated with the name. If the port name space also holds the receive right, that receive right 
will have the same name as the send right. 

A name becomes dead when its Associated port is destroyed. It follows that a task holding a dead namecannot be 



EP 0 704 796 A2 



10 



15 



holding a receive right urider that name as well. The dead namis only has a non-zero reference count' for the number of 
send references previously held by that name. A task can be notified with a message sent tb it, when one of its names 
becomes dead using port_name_space [port_name]-> mach_port_request_notificati6h /Receiving this notification mes- 
sage increments the reference count for the dead name, to avoid a racis vyith any threads mahipulating the name. 

Whenever a task acquires a right, it is assigned a port name subject to the above rules? Acquiring a right Inerements 
the name's reference count for the type of that right. The reference count can be obtained with port_name_space 
[port_name]-> mach_port_get_refs. 

Although a port name can be explicitly destroyed and ail references removed using port_name„space [port_name] 
^ nnachj56rt_.destroy. port names ai;e typically manipulated by modifyih^ rdferehci^ cpijnt: port_jiarlie_space 

[port^nariie]^^' mach_pdrt_rTiod_refs modifies the reference coUrit for a specrfied right ,tyge associ^Ud with a name 
port_nartie_space [port„riamej--> mach_portjdeallocaW is simil^^ to machJpo[t:;nr^_refsrbut it ^a^ ' 
the count by 1 , and it will only decrement the send or send-once reference count. This routine is useful for manipulating 
the reference count for a port name that may have become dead since the decisibn Was nriade to mbdify the narne. 
Options to nn^ch_nris9 that actually m a right and also port_rianr}e_space (pbrt_ha^ 

can cause the name's reference^c^^ to be decrerriented. Port names are freed when all the reference counts go to 
zero. If a port namd is freed and a deaid-name riotification is in effect for the hame^ a port- deleted notlflcatibn Is generated^ 
A ndme with a dead-nam^^ ' 
Naming a valid right . : : ^ - : 

MACH„P0RT_DEAD, witfi a d^ad-name notification having been sent when the rSrfie b&ame dea^ 
^ MACH_P0RT_NULL, with a pqri-deleted notification having been se^^ 

Information about a name, such as, the type of thW name, can be obtained with pprt_name_space [port_nameH 
macH_port_type. The list of assigned names is obtained wrth port_name_space [port_namei^ rfiach_port_namesy i^^^ 
name by which a right is known can be changed with port_name„s pace [port_name]-^rnach_pprt_re name/ Giyea 
receive right name, some status information can be obtained, With port_riamO 
25 mach_port_get__attributes- 

f^'^^^^j^^ associated niake-serid count; us^d for hc> more-sender notification 

P^'o^^stng. the make-send count is the kernel's count of the number of tifnes a send nght was rnade from the receive 
right {with a specifying a MACH^MSGJTY^^ 

mach_msg). This rriake-send qount is set to zero when a port is created, and reset to zero wheriever the receive right 
is transmifedjn a rtnessag^^ It cah als^^^ p6iOiame_^sp£i(^ [port>am^-4 mach^^ 

coun^ is included in the no:more-senders riotrfic^tiori message. Note that a no-senders notification 
indicates the lack of serid rights at the time tfie notificatran was generated There rnay stili be outstandirig send-once 
rights. A task can easily keep track of the send-once rights since every send-once right guarantees a message or 
send-orice notifioatiba R^Qeiyed messages are stamped with a sequence nuniben fek^^ the pcirt f rom \a^c^^^ 
message'was received Messages received from a port set are starpped'with a sequence nurnber fron^ the appropfia^^^^^^ 
mennber port. Sequence huriibers placed into sent messages ai'e overwrhteri. Newly created ports start wrt^ a zero 
sequerice number, and the sequence number is reset to zero whenever the port's recejve right is rricved. It can also be 
set explicitly yyith pbrt_name_space [port_name]-^ mach_port_setlseqn6.' \^ a message is de-queued from the 
port. It is stamped with the port's sequence number and the port's s^uence number is then incremented. The de^queuie 
and iricreirient operations are atorntc, so that multiple threads receiving messa^ s4qno 

fieidtb^fs^rtstrifcf th^briaiy^^ :^^/'*;.';: z;^^'-^ ^ - ■'^-^-'^'^^ 

^ S'nce jDprt^n^^^ c'reated'ahtf d^ with their dwni^ ' ' 



30 



35 



40 



45 



55 



Port Sets 



A. Pprt set is a^et of porfelhat cm be treated as a single' uri[it whem repe^^irigja message! A mach_msg receive 
*^P^]^!|9!^ js^^^^^^^^ ^9^inst a port name that either names a receive right, or a port'seL A port:set contains a cpllection 
of receive rights: W^^ operation is perforriied against a port set, a message will be received at random from 

one of t|rie ports in the set, The order of delivery of messages on a port set is indeterririinaie and subject to implementation 
50 with the foi lowing two caveats: . - ; \ . : . o 

1 ) No member of a port set shall suffer resource starvation. 

2) The order of arrival of a message with respect to other messages on the same port is preserved. 



Each of the receive rights in the set has its own narfie, and the set has ks own nainne A' receive against a port set 
reports the name of the receive right whose port provided the 'messages. A receive ri^ht can belong to only one port set 
A task may not directly redelye from a receive right that [s in a port set. A port set is_ created wKh^ 



EP 0 704 796 A2 



10 



15 



[port_name]-> mach_port_anocate or port_nanrie_space [port_name]-> mach_port_allocate_name. |t is destroyed by 
port_name^space [port_name]-> mach jort_clestrpy or ppi1_riame_space [port_narrie]-^ mach^ort_dea!locate, Ma- 
nipulations of port sets is done vyith port_name_space [port_name]^ mach_ppri^ this,caH can ad^ 
member to a set, remove it from a set, or move it from one set to another The rfi^mberstiip of a port set pan be folLind 
with port_name_space iport_name]-4 mach_poi1_get^^ \ . 

Message Transmission 

The mach_msg systenn call sends and receives rh icrokemel messages.. For a rriessage to actually be: t 
between two tasks, the sender must make a machlnnsg call with the send pptipri and the 

must make a mach_msg call with the receive option, uridertaking a re^ the two cajis 

is unimp;prtant. ' , . ! V. . / 

the send operation queues a message to a, port. The message carries a copy of the caller's data, the caljer can 
freely modify the nnessage buffer and put-pf-line regions after the send operation returns without aff ecting the data sent., 
Data specified as out-of- line in the nnessage is passed as a yirtual pppy or a physical copy depending ph sender options 
and the kemers choice of mechanisms, the ketrnei constructs a virtual memory image of the set of pages that de^Tine 
the region. "Out-of-Iine" port arrays are physically copied and translat^^^^^ 

space. If the kernel constructs a virtiiai copy, it zeroes the portion of the first page precedihg the data in (he yirtua^^ 
and the portion of the last page following the data in the virtual copy. The caller blocks u/itil the message can be queuesd, 
20 unless one of the following happens: V / X 

The message is being sent to a send<)nce right, Th^^^ 

The mach_msg. operation is aborted (thread_ab6rtji. By default, the machl_msg library routine retries operations 
that are interrupted: / I "X " ' ' 

The send operation exceeds its time-out value. j ! , 

25 The port is destroyed. J 

Sending a message is a two-step process. The first step involves constructing a kernel copy of the me^ 
second step involves the queuing orthe message. FaHures during the first step, such as invalid po^^^ 
addresses, cause the messageisendio fai with an error return, but no ill effect/ Faijures the secppd^^^^ 

may also occur when the send timef-out value is exceeded or an interrupt (thread„ab6rt) occurs, these failures also 
causp the! send to fail, but in these situations Uie kernef tries to return the rne 

do-receiye operation. This psey dp- receive bperatidn prevents the loss of port rights or memo^^ thiat pnfY; ?.^ist in ^^t^^^ 
message (for example, a receive right that was moved into the message, or out-of-line nnemory sent withi ttie de-aildcate ' 

The pseudo-recer/e bperatioh is yery similar to a riorrrial. receive operation. The pseudo; receive hand ^ pprt. 
rights in the message header ais if they \iyere in the nies^^^ 

be resent. Notice that if the message is not resent, oUt-M^^ ranges rnay have moved and spm^ port rights 

may have changed names. ' " ^ ' . | ' " 

the receive operation de-queues a message from a port, the receiving task acquires the port rights and dut-^^^^ 
memory ranges carried in the rhessage. The caller rriust supply a buff er into which the header and bciiy vyijlbe co^^^ 
The format oi the received nnessage is the same as Men seri^^^ does not fit, it is destroyed. lAh^o^^^^ 

(MACH_RCV_LARGE) allows the caller to receive an error instead, along wrth the buffer size that would be needed, so 

that another receive operation can be. attempted with an, appropriate sized buffer. , ' ' ' \ S 

A received message can contain port nghts and out-of-line memory. Received port nghts and memory should be 
consumed or de-allocated in some fashion. Resource shortages that prevent the reception of a port right or out-pf-!ine 
^5 memory destroy that entity. The reception of port rights follows the same rules as the insertion of a port right by any 
other means as described under port name spaces. The porl right dejscnptors describe the tyf?® C!9Ms received.,^ 

For esich dut-of-lirie region tr^hsnriittetf kernel jretums a descrip^^ inthe received mess^^ 
memory and indicates whether the kernel has chosen (cx has been Sirected the sender) to send a^ 
copy. Sending a physical copy guara^^^^^ Aithough backed by the default memory manage 

50 the kemel's virtual copy optimizyions might use the page images already in memory or might fetch page i^^ from,, 
the memory manager that backed the sender's memory. Although this can subject the receiving task to ah arbitrary 
memory manager, it can yield significant performance improvement over the direct copying of large amounts of mennory. 
The optimal case tor a virtual copy is when the sender has used the de-allocate option. See Virtual Memory Mahagerribnt 
for additional information. . 
55 there are two options for the reception of an out-of-line or port array memory region. In option one, the defauit case, 
the received region is dynamically allocated asterriporary memory in the receiver's address space, as if by vrp^allpcate. 
If the kernel transmitted a virtual copy, the received data ap^ space at the same alignment within a 

page as that bf the sender. In all other cases the allocated data starts at the beginning of a page bpundary. Under no; 



30 



35 



40 



EP0 704 796 A2 



circumstances does the reception of a message with this option stall the receiver by inadvertently referencing the virtual 
copy itself. In option two, the alternate out of line reception mode causes the 6ut;of-line regipn to overwrite the bytes in 
a specified area as if by vm_write. If the region does not fit in the specified area, the receiver sees an error. Urider various 
circumstances/ depending on the method of data transmission and data alighinerits, this option niight perform 0age 

5 manipulations or actually copy data. This operation might possibly interact with the memory managers backing any 
virtual copies and arbitrarily stall the receiver 

The choice of the two reception modes depends on the MACH^RCV^OVERWRITE option. When not set, all re- 
ceived regions are dynannically ailocated If set, the receive buffer is considered to describe a "scatter" descriptor list, 
of the same forrri as a descriptor list in a sqrit rfiesisaige/The kefnel scahs this list to deterriairie what to ddw 

10 received f-egion. If riot enough descriptors are supplied to handle the sent nunnbef of regions, any additiiDnal regibiis are 
dynamically allocated, the kernel boks at each out- 6f-line descriptor to determine the disposition of the corresponding 
region. A "sending" option of virtual copy Implies dynamic allocation, an option of physical copy implies overwrite with 
the address and size fields of the descriptor used to locate the region to be overwritten. An insufficient size causes an 
erroir return. 

15 The receive operation is also a two-step process: de-queue the message and then rnake a copy iri the receiver 

The first step might fail because the specified receive tirihe<)ut value was exceeded o^because the receive was abo 
(thread_abort). these situations do not affect the nnessage that would have been received. Most failures occur during 
the second step causing the message to be destroyed. / ; 

There i^ a nptification that can be requested as ia Vesuh ot a mach^msg tall, the notification is not generated by 

20 mach_msg, but is requested by the MACH_RCV_NOTIFY option. This option causes the reply port right that is received 
to automatically have a dead name notification requested for it as if by mach_jFKDl1_request_notification this op>tion is 
an optimization for a certain class of RPC interactions! the dead name notification on the reply port name alloyi/s the 
receiver of the message to be informed in a timely manner of the death of the requesting client. However,, since the reply 
right is typically a send-once right, sending the reply destroys the right and generate a port-deleted notification instead. 

25 An bptimizatiori to cancel this hotificatioh is provided by the MACH_^SENb_CANCEL.opti6n to mach_msg. Message 
ope ration s are atdrriic with respect to the man ipu lation of port rights iri message heade rs . ' 

Section 5. Virtual Merriory Management 

30 The Microkerners virtual nriempry design layers the ^^^^ 

dependent portions; the naachine-dependeht portion provides a siiriple interface for validatirig, ihvalidatlrig, arid setting 
the accesis rights for pages of virtual memory, thereby miairitaining the hardware address m The machine ind^^^ 
ent portion provides support for logicaf address maps (rriapping a virtual address space), memory ranges vyithin this 
map, and the interface to the backing storage (memory objects) for these ranges through the external memory man- 

35 agernent interface! ' / , . ■ . , ^ 

trie virtual rnemory system i$ designed for uniform mernory access nriu It iprocess^ of a moderate number of proc- 
essors. Support'for architectures providing non-uniform memory "access or no remote merriory access is currently beirig 
investigated. High perf ormance is a feature of the microkerrieil virtual memory design. Much of this results f rom its efficient 
support of large, sparse address spaces, shared memory, and virtual co^^ optimizatjohs. ! ^ 

40 Rinaljy, the virtual memory sy c ijehts to pr^ thereby^ de^ 

the semantic0ha^ ajDpty^te^such.fe^ ',|.! "J .^^..^ ' ' .'^'^.'X ..■X^v-XXX-^ . ' ^■"■"[X-- - ' v - -—r V 

^tteference^ 

Jarfies 'f^''fclag^? anii^Freem^ L/ Rawson! iilf entitled "METHOCy OF 
MAPPED AND UNMAPf='ED REGIONS OF MEMORY IN A MICROKERI^E^ is 

45 incorpbrated^^^^^ ^ , ^ ' 

Figure 6. show^ rpernory sitructur^es.^ 

sarrie backirig abWract memoi^ pc6sib|y" differing inhefitance or protection attr'ibutes One of the rrierpqry 

cadhe/abstract memory object pairs is shown in detail with two mernory object ref>reseritatiyes, fepr^esenting read and 
read/write access, and the memory manager task. A reserved, but un-allocated region is not shown, the region would 

50 be irriarked with drily the rese^^^ 

Reference is naade here to the above cited copending United States Patent Application by Guy G. Sotomayor, Jr, 
James M. Magee. and Freeman L Rawson, 111. entitled "METHOD AND APPARATUS FOR MANAGEMENT OF 
MAPPED AND UNMAPPED REGIONS OF MEMORY IN A MICROKERNEL DATA PROCESSING SYStEM", which is 
incorporated herein by reference for its more detailed discussion of this topic. 



55 



Part B. Detailed Description of the Invention 

Figure 7 shows a functional block diagram of the host multiproGessorsystemlOOrVvith^t^ 



EP 0 704 796 A2 



10 



the capability engine 300 managing interprocess connmunicatlons between two tasks 210 and 210' with threads running 
on two processors 110 and 112. Th^ data processing systenri can b^ a shared memory multiprocessing %stem,^^^^^ 
shown 1n Figure 7, or a uniprocessor system. The microkernel ^ 20 operating system is loaded into the rjiempry 102 of ' 
the data processing system 100. In accordances with the invention, the microkerriel includes the capabijity engine rnodute^ 
300 that manages capabilities or rights to rn^p regions of the rhemqry 102. The capability engine 30b,is included^^^^^^^^^ 
IPC subsystem 122, along with several other modules, including the call scheduler 232/ the message pass 
220, the temporary data module 400, the control registration module 500, the asynchrpnpus reply rnodule 600, llie 
transnlission ODntrol se^^ the shared memory support module 800, and the fast path module ^QO, 

All of these module provide services cpntributing to the interprocess communirations 22^ ^. 

The mic^ 20 forrhs two task containe 2 i 0 and 12 10', two threads that belong to the respective' 

tasks, and a^^^^ 242, The task(A) 210 is preyed w^^^^^^ 2^0 that rn^ into the task ' 

container 2fp. The port rights of the task container 210 enable it to uise the pointer 240 to refer to thiela^^^ 
occupied by the data module and access the data contained th The data referenced by the pointer 2^0 Is thus 
made available to the thread 248 during its execution! Thread 248 has the instructions it sequences through, prbcessed ^ 
^5 by the processor A, 110. - . y 

Task (Bj, 2'i O' also has an associated thi^e^^^^ sends instructions to the seco^^ 

(B) port rights are not sufficient to enable it to rnap ^ data in the data object 242 into its address space so that it wouW 
be d\^iiable to the thread 2# for executiom^^ n 
To illustrate the operation ot^the IPC 122 and th^ capability engine 300, the example is provided of,a user prpgr^^ 
^0 requesting that T^sk(B) be given sufficient port rights to enable it to rhap the data object 242 into Us addre^^^^^ 

thereby enabling its thread 248' to have the data .a^^^ for its pperations with processor 1 1 This is ad^m^iished 
by the message 24-4f rom task 210 to the message passing. librai^ iPC 1221 providin 

pointer 240 to the data object 242. the rnessage passing iibrary 220 requests the capability engine to check t^^^^ 
rights of the Task^ 

^ to Task(B) to access, data object 242. CSice the capiabjlity^ engine has jietemled the'^ cxirriparison 
rights and other factors, it queues Uie c^ll ori the caN scheduler 232, a\^^^ 

message 246, authorizing the capability by providing the pointer 240 and tBe port rights granted by the sehdiiig task(Ajr 
The message passing library 220 can keep custody of a copy of the port rights for both the sender task (A) and the 
receiver task (B) to facilitate future interprocess communications between the two ta^k cbritainers- ' - . 

30 Recapping, the first task a^ntainer 21 0 is formed by the microkernel 1 20 in the nnemory 1 02, having aset of attributes 

defining a fii^st cpmmunica^^ a first set of port rights, and having a f^pihte^^^^^ 

first set of port right^^ mernplry object ^42. A se^^^^ 

task container 21 0' '^^^he mempry 102 having a set pf 

second set of port rights. ' ^ ' ' " ' 'V ^ 

35 In accordance with the invention, the capability engine 300 registefrs the first set of port rights for the first ta^^^^ 

contairier^21 0 and th^ second set of port rightsjor tfie second task container 210'. The capability engine the^^^ 
the first sei of righVs ahd'the^^^^ of pprt righi^^^^ task 21 0' can be allowed to ga^^ 

access to the memory 6^ rights that can be attributed to a task's p^ 

various permission levels, security; levels, priority ley^ processor and other resource availability, alid nriany pth^^^ 
limited c^ly by the user's irna^ 
transfer oHllejDoiritef 240 arid the'first port 

conf er onto the second task container 21 0' the capability to access the memory object 242' In this manrieir.'the capat)ilitv^^ 
engine SOO rnaoages^the interprocess communication that rnust take place betweeri the haany clients and servers in the 
Microkernel S]^tem^l1 5^^^ . . - , . J 'V , \ ^Z^^-.' C 

the Irivehtlori 'applies to miprdcesWrs, share^^^^ muitiprbcessprs. and^^'^^^^ cpnriputers in a!te 

prccetssor systern. Figure 8 shows a functional bt^^ systeiTis^ * 

in a distributed pfo^^^ the liPC!subsysterh ,122 and^^t^^^ host^proc; 

essbr rinahaging jhte^^ COTrriunicatip^ tasks withHhe exchange of messages between the two hosts 

over a cdrhmunicatiphs link^ . \ ; ^ ^ : > 

In Figure 8, the thread i24:8' of the host 1 00 serids instructions to be executed in the i/d adapter processor 1 08. the 
instructions sent by thread 248' for execution can include those necessary for the formulation of a message to be sent 
from I/O processor 108 to I/O processor 108* of the host 100'. ^uch a message can include the information about the 
port rights and the pointer 240 sent from the Task(A) 21 0 to Task (B) 21 0' with the assistance of the capability engine, 
as discussed above. The port rights and pointer 240 information are sent iri the message oyer the communicalJons link 
55 250 to the I/O processor 108'. There, the thread 249 associated with task 211. executes in 1/6 procesi)r i 08' a^^^ 
transfers the information about the port rights and pointer 240 to the task 211 Another IPC transfer, similar to the first 
one described, can be carried out in the host 100' to transfer the port rights and pointer 240 origihally held by Task(A) 
210 in host 100, to task 211' in host 100'. The thread 249' belonging to task 211' in host 100', executes instructions in 



40 



45 



50 



EP 0 704 796 A2 



the processor 112' of host 100'. If the program executed by the thread 249' in the processor 1 1 2' includes a command 
to access the data in the data object 242 of host 100, the port rights and pointer 240 now mapped into or stored in the 
address space of task 211' enables the processor 112' to access the data object 242. The data from data object 242 
can then be transferred over the communications link 250 and stor^ in the data object 243 in host 1 00', if desired. This 
5 Is but one example of the role that the capability engine can play in facilitating iriterprbcess communications either within 
its own memory of residence or alternately with pro6essors having separate memories: 

Section 1: Subsystem tevel lri^^^ ^ _ 

10 The lPC i22 subsy^^ 

1.1 TheSimtiiiB li?6l22: \ '^'^'^ [ '''' ' 

As with all other siib-systernis ih the rnicrokemei 120, it is necessary to define the ihtefection/iriterfabes between 
15 IPC 1 22 and its peers. This helps in that it isolates the activities of IPC 1 22. easing the job of formal definition. With this 
kind of tdrm^f dilfthhlbri it also becomes pbs(sible to replace whbl^f subisysterris, prbvidirig a' powertuf tboHor system 
custbWiz^fion'ln its scheduling objects. It rnust also work ^ 

with the zbne Subsystem fer temj^bra Ideal ^tdfage. Fig. 9 lis a depiction bf the trahsfer bf a simple message aiid the 
high'leVerinteriaiction^o^ ^ : ^ ■ . . ^ . ^ . . v:^ 

20 High Level Outline of Activities 

Message-^ IPC 122 Subsystem/ ^ 

' ■ ->ZONES: Get Temporary Stbrage 

^-iPOFTT SUBSYgtEM:^Cap^^ ^ ' ' ' " 

25 Translate Port Object (Task) ' " v 

Push Message to Port, including transfer of port rights 

Check Port for Receivers 

SCHEDULER: Schedule Receiver (Thread) 

30 REckl>/e[¥->"W^^ -iv:: -^r- v:. " v ^v 

-ipom suBSYsf ■ V V ; V 

■ GbtbsiiBe^|f)^bri'P^ " / " " ' ' / . 

Awake with message ' \ ■ ^ ' ; 

-^ZONES: Free t^^ 

ifi M eftbrtlo pre^eWe^^^ sirrtplidity ofithe^^f to its peers, and tb prb\/idi^ a primitive 

seri/ice Upon whibW future di^feibjiSren^ richer messa^p^^sirt^^ 

IPC 122 acts dhiy bh dii'ect dat^ data types. The capabi^ 

122- All interactions with ports must go throbgh the capability bhgine 30d^^ cajDability eng^^ 3pd p^^^^ cajjs 
40 which cfeatSl:apbilities^brt'^ddre^ rahges, map cap^bjiities to X^Mif i^^^ 'qubue'nSes^^ges;^ tj 

ceK;iift^^:iH(^'^iibw is^l^ 

the^^iirti^tiiHS^ix^Mi^:^ 

capability erigine 300 can also act as'a synchronous or asynchronous stand alone message passing system. 

"^Iri the as^^brirbn^^^^ is:cjueued: Or: if a i^eceiver has 

45 be^ feioclc^^vjraSlnp^^ Hiei^sage ah^i i^Vness^ 

unlDlc^l^tHi'rice^ c^Biky fengj^ 

synblubnouslhife^ fn this'c^e, the cafjabilfty^ cheickslb^iee rthei i;efceivin§ tabk 

has set'iji| th^ r^^durc^ irhas,'thex^pabil^^^ to' 

match the'bvmdr of th^ target poh'is receiV If t^e resources are ; 

50 not avaiil^bie^'th^ sWid^r thread blocks, the ex^ entity will find itself Yetumirtg frbrn'thdi ba to the^^ame placeirV 
the code and with the same activation environment, (in most cases this means mostly the same kernei^tacky but oper- 
ating possibly in a different address space. 

the queuing mischanism and scheduling policies are associated with thei port object and are not specific to the 
capabiiily fehgirti^ 300. The specific sdhedulin^ queuing policy the clapabilily erigiriS-SOO w^ a port 

55 by port l^sis via calls to the capability ehgin^ SOO Th^re are two s^p^iite sets of calfe fbf sferid"^ orYthd 
capability engine 300. With orie set/the content of the irjessage fieici is bpaquetbthe capability ehgirie 300. With the 
other the cap^iliity eriglne 300theri tr^rlslatbs all ihcoiriing cajiabilities cbniB in the messalge to rights In the receivers^ 
space. The seiidnd set allows the capa^^ message passing 



EP 0 704 796 A2 



utility, , ^ . , , . . 

1.1.1 Capabilities: , , , 

5 From the subsystem perspective the only actual data types dealt with by the (PQ f22 subsystem are direct data 

and port rights. This may seem surprising until the usage of port rights as capabirrties is explored. Port rights may be 
used as a means of passing around the right to map a section of a task's address space, or an extent memory region. 
Hence, a message passing system could be created which passed small amounts^pt.xJata directj^ j)ut created pqrts as _ 
a means of transferring large amounts of data via the create capability and map capability calls. The ports become 

10 tokens, or handles, to describe a portion of ; the, task's mappe^^ ^ddi;ess, space, ^^^ J^ wotot^le, suffer^ 

from some measure of inefficiency as it would be necessary for a thread in a laskfe make a bapabiiity'ehgine 300 'call 
to return a capability for a specific region of its address space, pass this to the target task, and have.a.thread in the 
target task map the capability into the target task's virtual address space. Still, because of its simplicity and f unctionality, 
in short its fundamental nature, the capability provides a perfect primrtiye from vyhich to build ou passing 

15 architecture. ^ ... ^ ■ . ^ . v.;-;^ ... -; • : 

A capability can be denned as a right to map a mernory region. The region to . 
task's address space, in which case the capability is to be a shared mernory respurce, or itrnay be to ah exteril rnenripry _ 
region. In the; e)dent merriqi^ re^gion pa^^^ ' ] 

send or send once. In this way, the originator of the shared region could either suppress ynauthorized^ 

20 of the region or could allow the task to which the share region has been presented to extend , share regip^ access jto ^ 
other address spaces through the port primitive copy_send. Share and send_once are options on the create_capability 
call. Though the send_once option is available on extent memory region as well as share regioa capabilities, it Ijmits^ ^ 
receiver usage without benefit. After a receiver receives and maps a capability, it is fr|9^ tQc^^^ capability on 

the same region of memory. The use of send instead of send_once rights is the basis for the multiple share capability 

25 and represents an alternative method for establishing binary share. , , , , ^ ■ . . , 

1,2 The Message Passing Library: ^ 

Simple message passing as embodied in the capability engine 300 is fully functional, but does not provide the 
30 opportunity to take advantage of many performance optimizations available in traditionaLmessage passing model%. With 
the proper level of function provided by the capability engine 3Q0, a coresident, messa^^ placed in 

the supervisor space of a target task. This library would be able to make local calls lo thg ca^ engine 300 when 
necessary. The space in which the IPC 1 22 package resides is not part of the kerneLsf?ade^m^ into each task, but 
rather is a region of space mapped through a privilieged utility on a task by ta^k basis. A and 
35 subsequently the personality throUgh the personality neutral services would be allowed to ailocate portions of a task's 
address space assuperyispr mode, library repositprie.s. Jhey would.alsojbe aljpwedt^ 

into this spac^ which would a^ tp th^ keirnei service^ tj^^ way ^ ^ 

any nunrifer of pustpnriized m ends might be available in ^LSirigle systen^. the appjj^ would 

call the message passing library 2^^^ , ^ . . . ^, , , ^ j; . 

40 t^ecap9bj|yt^^^ 

data^pe.butffi^^ 

type, virfiich is sefendary in '^^^^ , 
bv-reference data type is also a secondary data type in.the sense that its function can be described as combinations of 
45 capabllity. subsyste exarnple, |eLu|^.say $,end 2 pages of ..cteta to . 

task'B.^task Aitja^ to^get jan exp^^ , . 

corr^e.sp9ndfn^ to ?^ ^9?ip thcoug^^^^^ 

may do ai capability map ca conversely,, we.may ^end the data usirig pur by-ref^ data type and a dire^ct data siz 
field. As will be shown, in many^rnessage passing paradigms, no ex^ wiji, eyer be created, the 

50 subg^iem wjl^ 

healthy performance tKX)st. • . . 

the by-reference data type is a pointer to a location in the user's address space and is accompanied by a direct 
data size variable or an implicit size dictated by the specific interface definition. (In the latter pase the size wpuld be 
transmitted to the kernel via the rnessage cont^^^ 
55 utilize, the same sort of capability interfaces or perfomiance driven^yarimts th^^ hWe to 

call to map capabilities directly. The variants will arise out of the opportunity to skip, the explicit (nessa 
altogether when the sender and receiver are both synchronized. These variants rep reser)tjntemal optimizations of the 
I PC 1 22 library which are transparent at the component level, the conditions under which synchronization is experienced 



EP0 704 796 A2 



and the opportunities created by it are explored later, but in general synchronization is present in all RPC cases, most 
send/receive IPC 122 cases, synchronous send IPC 122, and even in the general IPC 122 case whenever the receiver 
arrives at the message handoff point before the sender does. The juxtaposition of sender and receiver then is common 
and the opportunity it provides for performance enhancement makes it an, important consideration. 

s In order to make the by-reference data type exchange roughly equivalent to transfer of a capability, it is necessary 

to preserve the choice of mapping enjoyed by an explicit capability mapping call This flexibility can have a significant 
impact on receiver resources as a receiver may look at a capability after receiving a message and may decide not to 
map it. The nevy IPC 122 subsystem achieves this by separating the by-reference optimization on the sender^s side from 
that of the receiver Data sent, by-reference may be received as a capability or as a rnapped by-reference field. Con- 

10 versety, data sent as a^capability may be received^as such or as a by-reference field. In order to m 

system mapping functional'ity, it is necessary to allow the receiver to allow "allocate anywhere" and to choose a specific 
address at which to allocate. The IPC 1 22 subsystem grants this and more, it also allows the collection of by-reference 
regions into contiguously allocated areas (this mimics the sequential map of capabilities into contiguous regions), as 
well as their individual placement. 

15 The IPC 122 subsystem must of course use capability engine 300 functions to create and manipulate memory object 

capabilities, but the decornposition of tradit 

step forward in rthe formalization of subsystem level interaction and architecture : Strict formalization of Ihe proof of^ 

FlJNCTIpNAL equivalence b^^^ the simple nnessage passing model and the one including byTref erence data types . 

will fall out along the following lines. 1: Transition ^between by-reference and capability is complete (closed in ;the 
20 range/domain sense) and completely independent with respect to sender and receiver. 2: Capabilities are still provided 

granting the full range of receive side flexibility. 3:;. No function exists in the mapping of by-reference data which is not 

available as a decprriposed set of capability calls on a capabiii 

It should .be noted that.the new message passing paradigm as it is presently defined does not allow for the CGN- 

TROLLED.placement of data into UNMAPPED portipns of the task's address space, only MAPPED ones: :UNMAPPED ^ 
25 placement is^supported through.the sirpple mode! via a capability call on the target capability: There is currently ho plan 

to include^this pptiprijri the by-reference case as it can be mimicked by first mappingtthe region that is to be the target; 

of the incorne by-reference data: As was stated above, since the passing of capabilities is supported by the message 

passing facility, it is not necessary from a formal model perspective to mimic every vm_map combination. The eases 

supported for by-reference data make sense either from a pure performance perspective or in support of a traditional 
30 RPC or IPC 122 model. ' - ^ 

Section 2; Buffer and Address Space Resource DisUibution by-re^^^ 

Having created Ihe byrreference data class, the deternninatipri.of location; and; the^mornent of allocation^ become 

35 issues pf.e)rtrerTie imppi1ar^^ In the: case of capabilities ^;th^ 

of the .specific use> of 4he target: parameter in a target icall Such: specific knowledge is ot course -unavailable to the \ PC 
1 22 subsystem. In order to communicate buffer disposition knowledge to the IPC 1 22 subsystem for use oh by-reference 
parameters the subsets of parameter types must be understood and formalized. Separate data classes which are proper 
subsets pfythe by-reference;,^ 

40 made, with xespectrtpthel 66^^^^ 
by-rofeiBT)G^ parameter^b^^ 
ence buffersiare^tp.^ 

ha%been:Teturried^aj)ejt^ nature of data makes .it^possitjle to re-use the associated allocated address spacei.; 
henpe thp existen^^^ In.another example; receivers which. are acting as intermediaries or proxies 

45 may not need to access a data region and therefore have no need to map such a region into their space. They may 
wish to opt for the capability choice. New message passing models may bring to light further fundamental class additions 
to this group. However, new entries should be limited to collections whbh are determined by^a specific repeatable treat- 
ment of data either by the receiver or the send/receive pair (as in shared memory) which sets apart a group of by-ref- 
erence .examples. Jhis is thejiasis for all current subclasses outlined below. Functionality be^ by 

50 capabilities andtheirexplicit mapping should not be introduced into the message passing library- 220. Ror cases .where 
a new message passing paradigm involving performance optimizatbns based on roll in of additional function is required, 
a separate new. library should be created. This library is free to borrow interface and execution path notions from the 
message passing library 220,,but is not obligated to do so. Such a library would operate on top ;of the capability engine 
300 and as a peer tothe IPC 122 library. Message compatibility between the peers would not be assumed and ^£^^ 

55 efforts to maintain compatibility would have to be made by the new library architecture. 



EP 0 704 796 A2 



10 



IS 



20 



25 



30 



35 



40 



45 



50 



55 



2. 1 Temporary data 400: 

In RPC (remote procedure call) transferSi many send^recerve p^ir IPG 122 (inter Process Gommunicat ion) ca^ 
and some apparently asynchronous IPC 1 22 calls, the receiver knows ahead of time that sorfte of the by reference data 
parameters that will be received will only be needed for a short time. The tiine is bounded by sorhe transaction. The 
transaction may be known to the system (via RPC) or be known only to the applidation (when it Is engaged in async 
IPC 122 transfers). Through the IPC 122 interface, the receiver makes Ihe lPG 122 subsyMem aware bf temporary 
nature of the data by setting a parameter lever flag/ The implementation of IPG 122 has placed sorhe properties on 
Temporary data 400, it has been detertniried that it^^m^^ 

together with other seri/er temporary pararneters in a regiori%f memory provided by the^ receiver on a p6r Instance b^sis. 
The receiver is fully^xpected to reuse the^buffer for subsequent calls^ though th§ exact ha^^^ 
thereceiver. .■ ir.. , 

2.2 Permanent Data: ; 

This is the default fctass for by-referencedata, i:e., rt is:nbt shalred. temporary^ Iterhs fallihg intb this blass have 
no special usage constraints upon which the subsyjsterli'Cain base buffer dispositibn optiriliiatibhs: As a result, without 
specific instmctions by the receiver (jDlease see section 3:4:3 for details)^^^ 
convenient fdrfuture transferral or long term residence: Further^^^ 

the earlier CMU based message paissing sertiantics. These semantics included the placerheht ibf data into previously 
unmapped regions of the target address space: Default treatment of permanent data includes 1 /starting each^buffer oli 
a page boundary. This makes subsequent removal and tranisferral of regions through unmapping and /^m%>pi^ 
sible. Sometimes considered a performance win. this method would be awkward if data appeared on hoh 
aries^and parts of other buffers shared parts of these pages: Further; themapping and un-mappirrg resources would 
be fraught with artifact and residue: 2: When the data region is nbt of page^rhoduloi We u 

is not used by subsequent parameters: 'Again this is^lo laciiitate future mapping and uhrriappihg 3: PerfriM^nt^ata 
parameters are subjectio the ovenwrite option: This provides compatibility withthe ear Iter GMU message passing sys^ 
and gives a method forcall spedfic (or more usually server specific in the case Grdemultiplexihg^ervers)^disp^^^^ 
of individual parameters. - - i - i , . ^ 

2.3 Shared Data 

The shared data class i^equires specific setup anduniti^lizatidri by the sender ahd receiver During the 8^ 
the sender must explicitly make a portion of its mapped address space available to the receiver as a shared region, and 
the receiver must expressly accepMhis region as a shared one^ and direct it^tc^ a poftidn^bf its own ^addr^ss space:-Thu^ 
a sharedidata iregion catnnol enter a space withdutithe^exp Application associated with^hat^#ace ^ 

and edhversely^ai region of a space cannot boGome^^^^ 

threads. Shared data support is a fairly rich package which allows one task to sighal^ahother that ari arbitrary ^^^f^^^^ 
a physically shared area is now available, (filled with data; bleared of data etc;) VVhat separates the intfegr^ 
memory support 800 of IPG 1 22 from speeificiad hck^vusbvof shared memory space^and sbrYiaphores^is tfesDiiipdrt^^^^ 
the paradigrri^in^situatiohs^where^thfe^b^fties^domol^gh 
yet still?make; use tif^the- shared^paradigmi^ 8b^ 

tionsi^but^if^these^are^fouhd atxeptabl^ a hbn^lbcarcliferit>bPsef^^ is po^siblel^Furthw a^f drrhaPlahguage^i^ establtshWd » 
fordescribing^the portibn bf space which has- be^^^ 

memory techniques using^pecial hardware can be utilized ih-afashion which is trahspai-erSt tb the^two application lever 

parties.:., v-^-^ ' .::-^0}^ ^-ri :k ;y-- v) :t -^ej.-: . r. o--:; :, m.. .xv^./-^^ ./^vt .:■ ■ 

2.4 Server Allocated Resources v : v - ^ 

As' its name implies, this buffer sub-class is specific to RPC. In RPG the entire transaction, send and receive is^ 
described in asingle message control structure. Buff er regions which will be needed for data destihed for the client are;: 
by default, brought into existence during the request (client message send). This is necessary to provide the expected 
semantics for an important class of procedure calls (those in which the caller provides bufferspacefor data it is expecting 
from the called procedure). For cases in which the bdffer is to be provided by the server it is necessary to stippr^ss^ = 
buffer allocation by the IPG 1 22 subsystem. To enable the sirrtplest possible co-residency, we will Want to suppress IPC 
122 level buffer allocation through the use of the serve reallocated option. Even if we' were willing to accept the server 
side always expecting a buffer and having the library routine for the local call create this buffer, there is still a performance 
related reason for suppression. The sen/er may already have a copy of the data the client wishes to see. Full support 



EP 0 704 796 A2 



of the server_alIocate option means that the server is allowed to set the client sent parameter to point directly at this 
data, this is obviously the method of choice in a local interaction. If the server were always required to accept an incoming 
buffer, the local case would suffer The intermediate library routine would be forced td allocate a buffer, and the server 
would have to copy data from its permanent source into this buffer. A similar scenario occurs in the remote case and, 
5 though it slows down the trahsactiohjsl^^ 

2.5 Sender (server) Dea^ 

Tlie serfder deallocate buffer subclass is present in IPC 1 22 and dn the client side of RPC. It is characterized by a 
10 wish oh the part'df the'sehder d^allbcate the rrienibrV resource a^scx;iated with'aparamet Sfter thb Associated data: 
has 6eeh corfimun^cafed the receiver the existence' of the dballocate option allows the IPC 1 22 subsystem usdr to 
avoid ah otherwise urine^^^ 

It can b^ argijeii th$t in most casesyWhere the enyironrrien^ is subh thlat the message passing is explicit the best 
pos^ibl6 perfbirnance profile will be baked on buffer reuse. Buffer re-use, however, is not always practical even where 
is message passing i^ mapped into a space, wdriced upori and then sent on its way is probably well 

served by IPC 122 sender d 

In RPC, it is necessary to support the case where a caller is expecting the caM6 procedure to deallocate a buffer 
pointe^d to by ohe of the calling pa rBhrieteri/^^ availability of seiyerl_dea!loc, support of this behavioi- in the 

remote case; would require expfc by the client side stub upon return ffbm before returning 

20 to the appjicatibh. RPb^ls'o supports an analogous optipn oh the serve! sidie dubbed server^dealioc! ^en/eLdealloc 
can be used dn bufteVs a^sbciated w data the server is returning to the dieint; with buffers the server is' receivihg d^ta 
on or buffers which sen/e both tuhctibri^ In the s^iVdr send case serv_deall6c behavior is the mirror of send dealloc. 
In thes ciient siehd c^ise, the serve r_dean6c function appears tp opd^rate like ^ server associated with a server__dealloc 
follows the mies of perrmrient buffers, this irnakes it easier to riianipulate bh subseqiierit calls within the server. Further, 

25 the buffer which is deallocated when the server rSafes its reply is the one associated with'the ifepty data, ho;t necessarily 
the one allocated bh Ih^^^^ 

2.5 Trarismtssibn Iriformatibm 

30 Albhgi witH the normal he^d^r fields, the message passing mbdel pjrov^ides the oppbrturiity to gathier extended in- 

formation about the ongoing transfer through bptionil pararneters. These bjDtibnai mostly dire'dt data, 

but yirfiere they^re rto^^ are considered merhbers of this temporary subciasb. All direct data and by-ref erehbe pointers 
follbW the nbnnai head^ irifofiTiatioh/Data withiri the extended field^^^ rfibstly direct cpmrhuriication yvitfi^t^^^ 
subsysteiii' Vfeere r^^ the subsyst^rti, Irifbrmatidh siipplied by th^ other p^ the tiran|sactibh hriay 

35 influierice the infbi^ thi ar^ NDR arid STATUS^ l[NDF^ is a be^ciri^ion of t^^ 

lying hardware jev^^^^ 

details ) Other fields such as trailer may have Whole Sections giv^n over tb peer to peer cornmunicatidn between the 
sender and receiver stubs. .< v - \ . 

the header, alohg with the direct data portion of the optional transmission control inforrriation. *s serit as a by-ref- 
'40 erence parameter on the IPC 122 call. Ugon return frorri the call, the hea^^^^ 

in the safie burt^^ in itie' case 1^^^^ jpO T22 su^syste^^^^^ 

if the allocate oKoverrm^ tfansmission^c^ irifbrrnatidh is 

seriFasWen/ef1tei|p^^^ same buff iBf^ ihl^ 

inforrTiatidhappear^^^^^ ' ' 

MeiTibry capabilities Inriij^t be disting^ other port rights because the IPC 1 22 subsystem must be able to 

map them if the receiybr wishes. Moreovdr, the IPC 122 subsystem m able to create rhemoiy capabilities fronri 
50 byTrefererice desci-jptibhs. IPC^I miist suptDort the c^se of a client sending by-ref^ and a receiver requesting 

the information be delivered as a capability. Memory capabilities may represent snapshot transferral of data or a memory 
buffer to be shared between the sender and receiver. Passing of a shared capability to a server does not require the 
server to make provisions ahead of time. The server will be able to detect the share setting of the capability and will 
take whatever action it deems fit with respect to rhapp^ 



EP 0 704 796 A2 



10 



IS 



20 



25 



Message Passing Outline, The Major Subcomppnents 

3.1 Outline of Execution StructuriB: 

The capability engine 300 generates primitive SVC and strict asynchronous nnessage passing jnterf aces While it 
would be straightforward to emulate RPC, more complex IPC i 22, passive servers, etc, on top of the capabiii'ty engin^ 
SCO's primitive message service, there would be a significant penalty to pay in performance. The, I PC 122 library, rather, 
uses the capability engine 300 opaque message transfer options. Further, in RPC and two way IPC 122, the message 
passing library. 220 chooses to remove, the blocked SYC call, rennoying the need to c^^^ receiver, 

deqMeing it and then.doing.a thread handpff. If a receiver is ppt vvaitirigi the sender^ 
be through the capability. biock^^ sender queue rio expticft. messag^^^ 

allows most by-reference transfer to occur without the creation of ari explicit capability/ SucH a^d^^^ occur 
without any consultation with th§ capability engirie 300, Jvyo will serve to illustrate synchron^^ 

asynchrpnou$ rrie^ssage transfer^ v^^^ the message passing library, 220 and capability engine 3001 W^^^^ the receive 
thread is the first to arrive at the message pprt asynchro^ behaves as the sync^ 

chronous case, the execution path for the reply is not outlined explicitly. This is because it is an instance 2 
of the async cgse, (send with receive! wa^^^^ - ■ . > 

E)(arnple 1 in Fig. 10 outlines the pat^ i^ sent tpB port uppri y^ich there a^^ riq waitirig receivers. . 

In the^ case of an RPC the applipati^^ shown) the stub erriulates local procedure ca^^^^ 

caller. The stylD assembles the pieces pf a caH spe^ message , and traps to the rnessage passing Jibrar^ 
message p^sstng library 220, yvhich exists^ in ^^^^^^^ 

SVC. In bur example above, no one is wailing to receive the message^ message passing Hbraiy 2^^ a ' 

continuftion (if d^^^ and sends it on the svc call, There js also ari option ^^^^ 

thread handpff and not on thread block. The capabili^/ engine .300' based on these options, blpcks the sender with or 
without Qpritinu^^^^ ' _ ■ ' ' ' 

In Example 2 of Fig. 11 we again have an incoming send message but this tirn'e the capability engine 30d's check 
for waiting servers meets with success. The proper server is targeted for thread handoff. The capability engine ^6 now 
has both the sender and receiver and can proceed with thread handoff. After returning from thread handoff the message . 
passing library 220 can proceed to transfer the message directly. It should be noted that at no time Has the rriess^ 
format or conlent been exposed tp the capability engine 300. This gives the message pas^^^ 220 Jul) flexibility 

in chpQsing message fdr^ AND content. Future libraries cquld^ of butter aliocation of da^ta trarisforrnati^^^ 

schemes deerg^^^^ passing jibrary 22^ 

data directly between^^^^ arid receiver The only kinds bi. d^arwh reqiiins calls to the capab 
are port arid c^pabilrt^^ tran^fpnmations. This iricludes thadirjeQ^^^^^ pif capabilities an but also the mapp^ 

of a capability .{the s^ ippopje capability^b^ or the urirnapp^ of one. (the ser^ 

incpming by-refe bu^r be received a^^^ Sc^ieduling ag^^^^^ takes place wilhoui the jhterventipn the 

capability engine 1^ The sender or client U ialregdy^ bjopked and require anc^efc capabiiity^engjoe^ 

300 unless the client is to wait upon an explicit reply port. (If the server is not accepting anonymous reply, not guaranteWihg 
the reply will be, returned by the entity now recerving the message,) The scheduler is called if the server is to run wrth 
the receiver's scheduling properties. , - ^ ^. . . 

In.example three, of Fig. 12, we see things from the receive side. A receiver arrives, only to find there are no waiting 
senders..The receiver blocks^thr engine 300. The library of course decides upon the disposition of. 

the.b ock, i.e., whether or not to block.on, a continuation. Jt is the library's responsibility to guarantee against the arrival 
Of a send while the receiver is in the process of blocking, or to check once more for senders ajf^^ 
45 Example 4 of Figure 13 is identical to example 2 except it is from the receiver's vantage point. At user lever (non- 

supervisor mode) a message is either cobbled together directly (IPC 122 send/receive) or createdjn a. server loop 
designed to assist the target end point of an RPC (emulate a local call to the callee in a procedure call transaction). In^ 
either c^se a trap is rpadp to the supervisor Jeyel rness^^ passing library 22Q. After making the call to the capability 
engine ,300, tp get a sender succeeds, (hp message passin library 220 finds itself Avith a sender and a receiver. It 
^ prcK^eeds to trari^^ 

called if the thread is meant to run with the client's scheduling properties, thie client will block again. As in exarhpie 2, 
the capability Qngine 300 will only be called if the reply port is explid 

As with the synchronous case the message passing library 220 begins by checking for a receiver, when one is not 
found however, the asynchronous nature of the interface in Fig. 14 requires the library to engage in the expensive 
55 business of creating a formal message. Direct data is copied without tVansformation, port rights are generated according 
to their dispositions and placed in the message, capabilities are created for all by- reference parameters and capability 
rights are pushed according to their dispositions. All of these activities require the direct support of the capability engine 
300- The message is queued through a call to the capability engine 300. 



30 



35 



40 



EP0 704 796 A2 



In this case ot Fig. 15. the asynchronous model behaves similarly to the synchronous one. Since it is not necessary 
to create an explicit message, the performance ot the combination of examples 2 and 3 of the asynchronous case will 
be significantly better than that represented by the combination of asynchronous examples 1 and 4. The only difference 
between the asynchronous and synchronous example 2 cases is the lack of a pending reply in the async case. Without 
5 the need to wait bri a reply, the sender is 

Example 3 of Fig, 1 6 is identicai to example 3 in the synchronous case. Experience of example 3 behavior determines 
that the sender will experience example 2 behavior upon serid. This means that performance conscious users of the 
asynchronous model s^^ 

Example 4 of Fig. 17 calls the port sp send queue and recovers the appropriate message based on queue 
10 function specific criteria: The rfiessiage is made up of'direct data transfor- 
matiqn. Jt may also contain ports and capabilKies The ports and capabilities must be translated via calls to the capability 
engine 300. Small by-reference fields may be nriasqueraded by the message passing library 220 to avoid the overhead 
of capability creation at the time of explicit message creation. The receiver may still choose to receive the field as a 
capability^ but if it is receive as a by-reference buffer, the capability engine 300 will hot have to be called. 
15 '-'^ " ■ ■ ' ■ ■ ■ ; 

3.2 Message Structure: 

The rnessage structure associated with the message passing library 2i26 contains the functional elements that one 
would expect in any message passing interface. It of course has pirovisibn for the buffer disposition options outlined 

20 above as well as the prirhitive data types. However, the message structure differs from many other systems in that the 
fields associated with overall transmission control, those residing in the header, do hot have to be contiguous with the 
rest of the message structure, the message structure identifies 4 separable entities. The header points to 2 of these 
and the message parameters themselves point to the last. The four entities are 1 : the Header 2: The message control 
structure (contains infornnatiori abput the specific parameters associated M specific call) 3: the message (the direct 

25 data, by-reference pointers. poi^:s, and explicit capabilities associated with a call) 4: The by-refei^ence regions. 

There is no restriction on allowing the regions to be contiguous. It is possible that there is sprne small performance 
advantage in ha^ all in a contiguous form but if they are formed at different times, rt is not necessary to recopy 

them just to make them con^^^^^ 

The nriessage paranneter inforrn^^ parameter in the rnessage buffer. The description 

30 fully defines the parameter data ty size (either directly as a fixed size or indirectly through a pp courit pa- 

rameter) and disposition of the 

throughout the design phase of the message passing library 220, one nriust be mindful that the performance ex- 
hibited by an implementation can be greatly influenced by the layout of associated data structures, this has influenced 
the layout of the bits of the [Dararheter descriptors of the rnessage control structure as well as the separation of control, 

35 message, and leveK it was'further realized that in many of the irriportant modes 

of use, the infornnatioh ass^iated w^ the substructures wais geheirated at drfferent tinries and in different places. It was 
only good pr6granitTiinc| fo in the overall design of the message structure. Upon further analysis this 

nod to. good form turned out to yield significant performance gain. 

40 3.23 The Sepa^^^^^ i.,: • . V, .., . 

^ Jn^kn^a^^ aware of r and dlrect^^^ message 

passing the^sepa^^ passed is large, the wide variety of by-refer etnce 

data types is available to avoid excess a large and interesting set of uses of message 

^ passing in v^^ich the endpoint application calls a proxy library routine to do the actual message passing trap. These 
uses include the e^^ some interestinglPC '122 cases. If the endpoint application rhai<es a copy 

of a rather large parameter in the act of a calling a pro^ user level message passing library 220 routine, it would be 
nice tobe able to use that copy insteiad of recopying the data just to make it contiguous with the message header Again, 
if the message is small or the systefri stack conventions are not know to the proxy service the proxy can always fall back 

50 to message data copy. ' 

By knowing the address of the parameters sent to the proxy and assuming the parameters are contiguous and in 
a known order, the proxy may pass the address of the parameter block as the address of the message head as shown 
in Fig. 19 In support of this, an extra type was added to the by-reference sub-types outlined earlier, the pointer to a 
by-reference parameter. This is very popular as a means of returning an altered pointer to a data structure in languages 

55 such as "C" which view the direct altering of an argument by a called functibh within the scope of its caller to be a violation 
of scope rules. 



EP0 704 796 A2 



10 



3.2.2 The Separation of Static Message Control Information: 

Of the pieces of the message which have been isolated iritp sub-structures, arhbng the mbst important one is the 
message control structure/ Performance gains realized by the separation of the message control information promise 
to be significant across the whole spectrum of supported message passing. It should be emphasized that the information 
contained in the message control structure completely defines the transaction from the endpoint's view. (Receive side 
override which can be viewed as an exception to this will be discussed in section 3.Z4.3) The information pertaining to 
message invocation contained in the header is a dialoguie between the caller of the niessage service and the mesisage 
passing library 220. Though inforrriartio^ the ^o endpotnts;^^ v^^ 

transrnis^ipn control fields, setting ari pp^^ the send side does not rk^iJire the setting ^ on the receive 

side and visa versa. Fig^ 2^ a fijgh level sketch of the message cc^^ ' ^ ^ ^ ' 

Primary descriptors have a one to one, mapped and onto relatioriship with the parameters of the nriessage, the first 
corresponding to the first parameter, the second to the second and so on. Further, the primary descriptors are required 
to be contiguous. In this way it is possible to find the descriptor corresponding to the 3rd parameter by simple 6ffiset 
from the beginning of the descriptor section of the message control structure. Primary descriptors may not be 
enough to carry all the state information necessary for a particular parameter. If this is the case, a field of the primary 
descriptor points to an offset within the message control structure which corresponds to the start of a secondary de- 
scriptor. Secondary descriptors corne in different sizes, appear after the primary descriptors and in no particular order. 
Full definition of the message format within a compressed and careful iy thou ghf out structure is significant from ah 
20 interface definition perspective. Server's can check the match between the message fornnat they expect and the one 
the sender has providkj with a simple binary comparison check of the rfiessage control structure sent by the client: ' 
When a match is found, the se.rver will be guara^^ that pointers within the message will be pointers, ports will be . 
ports etc. It does not guaraitee semanti^^ meaning to the associated data of course, i3uf it does mean the seiver is " 
protected against random pointers and random values for port rights. The sen/er is guaranteed of this because th^ 



IS 



25 



30 



message controrstruct ure (and server provided overrkjes) hold the sole deternri inatiori criteria for thi9 rhessage pararrieter'^' 
format. ^ '""^ ' ' \ ^ ' " " i v-.. k- 

What further makes the message control structure special is; that it is defined prior to execution, to avoid uri-nec- 
essary work, the proxy may point the header based message control structure poihteho a fiWd copy of the entity in 
BSS or pther storage, thus avoiding the need to create the structure in temporary local sto each time this proxy 
function is invoked. This prior definition of the message control structure is important for aiother reason 
can be set up which allow the transfer of messages based on pre-scireehed message; fehtroi, structured 
need only supply a registration label. This will avoid not only the sending of message control information but also the 
runtime comparison between sender and receiver for noh trusted sends. / ' " 

The separation of the Message Control Structure is helpful in another way, makes it easier to leeive it off. If a hhe^sage 
35 only carries direct data, there is no need formes^^^ 

have been dubbed "SI fy1PLE" me^^^ Simple messages can be ^^^^^^^ 

simpje message may sti 11 reiquire a message control structure if the server 'wishes to test it for compatible format'. This' 
should be a very limited case however. If the server is exjDecting a particular message or recognizes a group of nfiessa^ 
id's, a simple message of the wrong format behaves no differently than one which sirpply contains garbage data. The 
^ only case where a server might need a message control structure is on messages^containinq variable simple data format 
not distinguished by message id. Unless the data is self<lefining,'the receiver would have tolodk at the ririe^ssage control" 
structure to find th e parametei; boundaries. I n th e case of sjmp le ^ messages , ^the sender is not requi red to s upply a 
message c^^ passing libraiy 220' do 

one to the' receiver transfers which" the receiver needs the message cohtroi structure, the sirriple option should Be' 

^5 turned qC Z^';";" ' ' Z;';'y 'V,^'.'"'.; ; ''"'^ ■■v^.:^>:.-/:'-<^■'-■o. - joc-: 

The message control structui^e has been set up to define EVERY pararheter being sent this is irnportant for receivers 
which accept messages \which are not pre-defined. Without the definition of every pararheter, the server woujd riot be 
able to parse the ihcornihg rriessage. there have been efforts to improve the periomiance of message pass! rig by 
declaring alt direct data to be a sjhgle field. Expertmentatiqh with prototype code on the new nriessage control structure 

50 has shown that parsing through direct data fields has almost no performance irhpact. (tfie loop to parse direct data 
consists of checking a bit in the parameter disposition field, and upon realizing that it is direct data, adding a count field 
value to the offset within the message data structure to point at the next parameter. The bump to the next parameter 
descriptor and the loop check are the only additional actions.) Even with this evidence there are sbrne who might still 
argue that the overheaid is unnecessary. In the unlikely event that sorhe message might benefit frorri the coalescing of 

55 direct data, such coalescing can be done at the proxy library level. The proxy can re-arrange the message fields, putting 
all the direct data fields together and labeling them as one field for the transfer. In this way the message passing library 
220 can preserve within its model the convenience of identifying all parameters without any compromise of performance, 
real or perceived. 



EP 0 704 796 A2 



3.2.3 The Separation of Transmission Control Information^?^ 

The transmissbn control information subsection of the message structure consists of the header and an optional 
group of transmission variables. The information contained in this subsection is characterized by two things: First, rt Is 

5 information that Is presented directly by the caller of the message passing library 220. Second/whatever its Origin or 
final use, the fields and most often the data are parsed and interpreted by the caller of the message passing library 220 
and the library itself. The motivation for the separation of message control information from the rest of the message js 
the same as that for the separatbn of message data and static control iriformation. The collection of fields found in the 
transmission portion of the message is created and manipulated at the same time and by the same routine regardless 

10 of the message p^^^^^ 122, RPC by:p^roxy) This guarantees that'there will riot be unnecessary copying. 

Further, the strictly enforced point of interaction for the transmission section is the message passing libraly 220; Tfiis 
preserves the very important role of the message control section as the one place to look for the sender/receiver dialogue 
determining message format, not just message buffer forrhat, A sender can only Influence the format of a message 
delive/ed to a receiver through the message control structure, the formaf of the message buffer is completely determined [ 

15 by the message contrbrstru^^^^^ and the overwrite buffer (The ovenwrite buffer allows' the receiver to exercise jocal 
override on firiial dtsposilibn capabilities and by-reference regions. Please see sectldn 3.2.4.3 for details). The format 
of thie header returned f rorn a call to the rnessage passing library 220 is determined by the options chosen in the trans- 
mission control section at the time of the call. Hence, a receiver will be returned a imessage whose header f ormait reflects 
the receivei^s transmission coritroj section requests at the time the* rei^ceive call was rnade. the message buffer foi^mat , 

20 on the other hand, wM^ reflect the data sent by the sender arid whatever the sender's miassage control structure dictMed 
expecting where by-reference and capabiiity disposition has been influenced by server use of the overwrite buffer.' 

If the caller wishes to influence some specific aspect of the rtiessage transfer, it interacts with the library 

through the transmission control section. Data can be passed from sendei^ to receiver through the transmission section 
but this data is interpreted fay the message passing library 220. In the case of sender to receiver communication, the 

25 interfaces are defined such that there is always a default behayior which is acceptable from a format perspective re- 
gardless of the rernote party's actions. In this way the reriiote party's choice of transmission control options has ho 
influeirice on^the fprmai of the Iccal message This is criti^^ in maintaiihirig the message control structure as the only 
source for message format determination where a sender may influence the format of a message received. STAtUS, 
Trailer_Request, and Mes^^ 

30 A simpfe example of between the message passing library 220 and a receiver can be shown with 

NDR request When a rriessage is sent the sender has the option of including m NDR^Supply parameter this is only 
done If the .prinriitive da^^^ based do not match the host machlrie. jf the 

NDR_Request pptipn is active when the message is delivered, the message passing library 220 will by default pass the 
NDR infdrmatbn of the host machine. If the sender opted for NDR.Supply then the rhessage passing library 220 witi 

35 pass the infprmatipn offered by ^t^^ 

Another impbrtarit capability of the transmission control system is its ability to pass uninterpreted data betvveeri a 
sender; and a f:ecejver. Such data can be passed from proxy to proxy without altering the endpoint message buffer via 
the trailer. Certain fixed fields are present in the trailer including sequence number and security token, beyond this is 
an open data field If the size of the trailer is fixed by prior agreement, and the sender sends data, and the receh/ef 

40 requbsts it/the trailer nniay arn^ header area. If the amoun^^ in the trailer ya 

call to calll. tfie receivier rriay wisH to^ request the by-re^^^^^ cijrect data vefsipri of 

trailer request Includes a count parameter, the count sent by the receiver is the rriaxlnfium that will be reSceived; mo\e 
incdnfiing data is truncated, Thjs count vanable is changed by the message passing library 220 to reflect the amount of 
date'sent^^^^^ up to the space provided iritlieterriporar^ data 400 buffer the 

^ receiver rriust use ^t in either case should the sender not provide a trailer, the trailer received 

wilfmly cp^^ If none are requested, the size could be zero. The area beyond the defined 

traiJerTf leids is ^p^^ receiver just as it was sent the method the receiver 

decides to obtain the trailer ihfornriation by has rid effect on the sender. The sender is free to send the Information either 
directly or by-reference. 

50 When a cailerbf the message passing library 220 prepares a call, it sets up a header structure. This header structure 

sits in a buffer which must be large enough to accept, not just returned header information but also direct data associated 
with the transmission options the caller has requested. This also includes room for any direct data (by-reference objects) 
requested. This implies that by-reference regions associated with transmission control parameters are conisidered server 
temporary. As will be detailed later, when a server in an RPC or the target of a 2 way IPC 1 22 calls the message passirig 

55 library 220, the header sits in a buffer which must not only bie prepared to accept all of the transmission control irilormalion 
as outlined above, but also the server temporary data 400. The format of the retur'ned buffer is, header at the top followed 
by direct optional control information, followed by server temporary fields, including those associated with the transmis- 
sion control information. 



EP0 704 796 A2 



Figure 21 is a diagram outlining the transmission control structure. The fixed portion of the header determines the 
kind of message passing, i.e., Send, Receive, Send/Receive, RPC, IPC 122, the icihd of reply port in 2-way nriessages. 
The optional portion of the header is determined by the optional transmission flags field. Each of the optional fields 
corresponds to a bit. When present these fields must appear in sequence. For opironar transmission fields which are 
by-reference. A sub-fielcl of the optional field entry for that parameter will be used to point iri to the temporary buffer 
address ^rea. Another sub-field will describe the size of t^^^^ 

3.2.4 The Relationship Between Sender an^^ 

All naessage p9ssin systems must deal with the problem^of cbordiriatjng sender and receiver message format and 
identification. Spo^^^ assuming that the^^question of message format is settled by the sender and recelyer outside 
of the message passing .paradiQni. Others pass^ partially or fully defined messages that the receiver rtiust parse to de- 
termine what it is and i^^eth^r or not it shoutd be accepted^ Both points of view have their advantages: in ah embedded 
system sending fujiy ,tmsted nnessages, it is hardly "necessary to burden the pi-oc^ssor with, generic nHes^ 
On the other hand, in the general mes^^^ is a real need f or rion -trusted communication 

between senders and receivers vvhe must verify message format ' General rtiessage passing also imakes 

use of generic receive seh/ers whicfh parse a message to determine it^ separation of nnessage control 

information, the m^^ 

Except in the case of simple messages, the sender must provide a message canlroi structure when a send rhessage ' 
call is rnade to the message library 220. This coriventibn is absolutely necessary in the case of asynchronbus 

messages where sery^ input simply may not be available. Although not absolutely necessary in the synchronous cases; 
it does provide a discipline. By requiring the supply of a message controi structure firom the sender, the receiver always 
has the option of checking the incoming message format. Further, the nuniber of nonsense messages dejiVerecj from 
non^njsted ciients is likely to be lower. If the client sent a rriessage and relied on a sen/er message control structure to 
parse it, some percentage of the time an incorrect message would get through based on the ability to incorrect ly but 
undetectably interpret the client message parameters. A n6ri4 rusted client wbuid then be sending garbage data to a 
server. If the client is required to send a message cphtrbl structure, the ser^ rioh-tru^ted client message 

contrpj structure, avoiding the receipit of garbage data! (the client can always deliberately send garbage data,' of course.) 
Having the sender supply a message control structure also reduces the possibilily of unintentional darhage to the client. 
If the client were to send a message to the wrong port in the server nries^^^ paradigm'! arid that 

message were to unintentioriaily succeed the client might lose large^i^ unmappihg ahd'bverwri^^ i!e., a 

citerit may send a message to a, server, e^ ihere are two direc^parar^^^^ bejieves' the first 

parameter is a by-reference and that further, the associated buffer is to be rembykj after the client send; Ndw if the data 
in th^ client send just happens to look like a valid address, the client will unihteritionally unmap a portion of its address 
space! ' ; ■ ■ ■ ' . 

Fig. 22 shows 2 examples of message control structure usage with out accepted conventibn, client supplied message 
control information. ' .. . /' 

The message passingjibrary 220 consults the sender supplied message cx^ to translate ^11 rtbh-direct 

data pararheters. The server, however, is expecting messagefs of only brie fonrtiat, or in the case of a demultiplexing^ 
server, messages whose format is determined by the message id. The server, therefore, does hot reduest the message 
control structure and acts on its assumptions. Such a server could be darhaqed by a client either ihteritibhaliy or uniri- 
tentionally sending a message of the wrong format. , 

. vyith the receipt d Client's messag to check the format of the inarming 

message aga^^^^ If the "server is checked first to deterfjiihe which . 

amqngsra set.d particular incoming entity shouHnia^^^ 

message ^contrd^ cohsuited in order to fDarse the rhessage ^^d^^^ as shown in Fig. 23: This last sceharfc) is 

most likefy vyhen the server is acting as an mterrhediary for anqther server The use of the message passihg interface 
to inripjemeh^ comnnunications server can make a good example of the power of the rriessage passinig library 220. For 
two communicating nodes the integrated shared memory data typeis can be used! If the nodes share common memory 
(or hardvyare supported rnirrpred memory) the transfer can take place vyithout overt memory copy. If not, the transfer of 
data occurs automatically without adjustment to the 

3.4.1 A Fully Defined Send-Receive Compatibility Check 

Even if a sender and client have fixed on a message format, or in the demultiplexed server case, a series of message 
id pared formats, the seryer may not trust the client to do the right thing and send the appropriate nnessage. Actually 
verifying the message fprrnat has historically been a dodgy affair. Changes to Brrterface or missed subtleties often left 
holes in the check. Further, the more complete the check the more costly. The architecture of the message passihg 



EP 0 704 796 A2 



message all but eradicates these difficulties. All the information required to describe the data types found in a message 
buffer can be found in a message control structure. Further, the portion of a message control structure associated with 
the definition'of incoming parameters contains no other information. This makes It possible to do binary comparisons of 
server stored message control templates with the incoming client message control structure. The distillation of message 
buffer information is such that the average parameter is fully described in 8 bytes. Thus the layout of the message buffer 
for a 4 parameter interface might be checked by a byte to byte comparison of 32 bytes. The fact that other portions of 
the ihterfape, Jike those associated with transmission control protocol are described elsewhere means that there will not 
be an unnecessary restriction dh trahs^^^ 

The 'RPG system sho^^ be noted here because the message control structure describes buffers and buffer dispo- 
sition for Both the i-^quest and reply. It is very reiasohable that a server would support clients that chose different 
local buffer disposition pptipns. As an example let us consider 2 clients Which both want to interact with a common 
server. They both want to send a by-reference field to the server One warits the buffer removed after the send, the other 
wishes to i^etaih it. It would be awkward if the sen/er were to reject one of these two clients just because neither of them 
was trusted. The bits of thdi jDarameter disposition have been set so that this case can be handled. There is a field of 
bits assbciated w disposition. (Bits 23-18 of the flags word.) By applying a mask to these bits In the 
template and the client derived message control structure before the binary check, the server can service both clients 
in non-trusted mode. \ ^ 

the Wxarfipje brihgs but one other important point; The check of send/receive compatibility is not only optional, 
it is user level, though the use 1*1 eve I librrik library support will include the binary byte by byt e check and the client option 
mask oyerride for R^^ coritror structures as callatile hnacros, the server is free to fashion any sort of partial 

check ft se^s fit. For iexarnple, allowing clierits yvh send a buffer ai temporary as well as those which send it as 
permanent with the dealloc 

3.2.4.21 Coritroi Inforrhatiori Registration 500: 

the distiildtidn o^ rnessage' control inforiTiatioh and the existence of simple messages which do not requine control 
structui^es, the fiexibilrty of the send side check and the option to leave it out, all hay e significant funptional and perform- 
ance innplications. However, there is one more opportunity for performance optimization which gives npn-trusted clients 
almost equal performance to the non-checking trusted case. Further, it speeds' up both trusted and nop -trusted by avoid- 
ing copying the rrieisa^^^ control structure into message passing library 220 space on a call by call basis, even on 
complex messages. The method involves message control structure registration 500. 

A server wishing to participate in registration, makes a registration call for the message contrpj structures associated 
with the server's set of interfaces. The registration call parameters are the message control structure, the assiociated 
port, and a placeholder lor the returned registration id. The message control structure becomes registered with the port 
for the buratioh of the life of that port! in this way senders who acquire that registration id will be guaranteeci that it is 
valid for the life of tlie port.' A client wishing to send messages via the registrat^^ contacts' the server with a 

simple caii, sending th structure; possibly containing a rriessage id, and asking for the associated 

registi"ation number the server is free to run what checks it likes, but in practice absolute connpatibility is required. 
Should the server detect for instance a difference in client local iDuffer disposition and pass back the registratipn id 
anyway, the cliept' wopid b^ 

wh iCTi ^cJoes , not match exact jy or regi^er an additional ^ structu re for thai p^^ lar m|^|age jd. ' tH e 

server vfeuldl^ 

plate registration number alid'tl^^^^ oh. The server 'shoijid also keep a copy of the clieiit mess;age control 

structure on hand to check against future registration requests. If a client is refused a registration nunriber, it Js still free 
to attempt non-registered transfer. 

The registration of message control structures for servers which persist oyer long periods is certainly indicated for 
botli trusted and hon-tru^ It wilt be most significant in the non-trusted case, however, since it 

removes the need to copy the message control structure to the server and do the cal | by caill check for format compatibility. 
A registered server viflli w^ both registered and non registered senders. Therefore, if a sender is only going tp 
interact witrt a receiver once or twice it may not be deemed worthwhile to do an extra call to retrieve the message control 
structure registration id. 

Fig. 24 shows a diagrammatic depiction of a message registration and use by a sender When the client attempts 
a send with the newly acquired registration number, the message passing library 220 checks a port associated queue 
for the proper message control structure. The message control structure is local to the message passing library 220 and 
thus a copying of the control structure is avoided Further, on RPC's, it is necessary to keep the message control structure 
handy while the client \s awaiting replay, one control structure is kept for each ongoing transfer In the registration case, 
only a registiation huniber need be stored! The message passing library 220 is set up such that the client must request 
the registration inf orrhation of the server for two important reasons. First, it reduces the code which must be nria[nta[ned 



EP 0 704 796 A2 



10 



15 



in the message passing library 2?0. Second, the seryer niaintains full flexibility in deternnining who matches registered 
message formats and who does i^ot. Use of the ovenwrite option and reply ovenwrite can make a yvide variety of incoming 
message formats compatible. It is up to the individual server to sort through this and support the set of formats it sees fit. 

3.2.43 the Overwrite Buffer 

Receivers who wish to influence the placenrjent of permanent data and receipt of capabilities in their space upon 
the acquisition of a message must supply an pyen^^ data influenced by an overwrite buffer are 

1 : PeniianenJljciata (note: the^permanehti by-ref e^^^^^^^^ include^^^ server deallCK^ cases ) a^ 2: capabirrties. 

It is f5(^st[3le via the oyervyrKe buff^^^ a niapped area of rnehrpiy. Or have an 

incoming p^^ V ? ! 

Ovehwrite Buff ers a^ control structure. As such of cburse, they only affect 

the local bailer Ovenwrite iias additional functionality. The pvenwrlte buffer considers capabilities and ty-ref ere per- 
manent regions to be enumerated or indexed as 'encountered. A^^ the incoming message is scanned, the fir^st ericpuri- 
tered capability or permanent by-reference region is influenced by the f^^ in the receive qyervy buffer, the 

second eiicountered, by the secbnd descnptbr and so on. Intervehing pararrieters of other ty fiaye npiffect.^ The ' 
only exception to this is when the receiver chooses the gather option. Iri this case data fronri multiply by-reference regions 
or that associated with capabilities is concatenated together and written into nnempr^ at the Icxatipn s 

by the pyem^^ descriptor. Any number prdescfi in this way and there js an optibn to rn^^^ 

the nurnber strict or "upto". In the strict case, exactly the stated huriiber found to fill the gather 

descriptor area or an error is returned. In the "upto" case, it the ri umber of regime specified in the descriptor js large r^ 
than the available numberbf regions in the incomi^ the message proceeds any way in the over- 

write region which account for regions than are found in the message iare ignored. LikbwiseV if the overwrite descriptors 
account for fewer permanent by-reference and capability parameters than occur in the message. The parameters beyond 
25 those enumerated by the overwrite structure behave as if the overwrite option had not been exercised. 

Use of gather often necessitate request of send message control information by the seryer .so that tfie. actual 
size ahOT r^d'pns and capabilities will be khow^ The cpntrpi structure must also be cphsulted to 

find the dir^^^ \ L : . . ^ 

In tfie c§se of RFC it is necessary for the server to construct a message buffer for the reply which in the fprniat the 
clierit expecting. In two way IPC 122 of course this is aj ways true as theire is ho prptocdri^^ bemeien the fprm^^^ 
the send and the receive. Fig. 25 is an, exam^^ 

3.2.4.3 Reply Overwrite Control Information: 



20 



30 



35 



40 



When using the pyerwrite pptiph, ca^^^^ 

taken tp ensur^ the post, or ieply ir^tetiace might .well have been set up to 

deaticxab a b^^^ region using the server-deallpc c^tim^^^ )( t^^' server has re-directed by-reference data to a 

regioh ffiat it wishes to persist past reply delivery, h'rniist p^^ structure: . Upon detect^^ 

of the servef side reply side control structure, the message passing library 220 scans it for server side bijffier disppsition 




upon which the server-dealloc option was set for buffers which were only passinqjnformation to the server. This however, 
was an insufficient answer for the buffers beinaused to send data both ways or just to the client. 

^ Section 4: Messa^^ 

The capa^^^ is defined tp cre^^^^ message passingi service. Jt'has ddhe so by 

forrhalizihg all transfers a$ eithisr, direct data or capabp^ passing system such as MACH, 

such poiis'can be used to pass accesss to any trahsformatiori. The message passing library 220 carries wrth it the function 
50 ofthebajjabilityen^ 

must be done expiicitly in thie capabi lity erigihe 300 paradigrh and creating a language to express there transfer witfiout 
creating formal capabilities. If the two endpoints of a transfer both use the message passing library 220 then a mapped 
area of a sender's space can be described in a message and the place to write it in or map it in can be described for 
the receiver 

^ Asynchronous messages still require the creation of c^abilities because the data sent must be captured before 

the sender returns and the receiver is either npt yet known or not yet ready to receive the data. Synchronous interfaces 
on the other hand need never create intermediate capabilities for the by-reference data types, because the sender must 
pause for a pending reply anyway, the point of synchronization for the client is not the return from the send but the return 



EP 0 704 796 A2 



10 



from the reply. This allows the message passing library 220 to pause the client before message creation and to proceed 
only when a receiver is available and the transfer can proceed from task space to task space without an intermediate 
message. It is clear then that the message passing library 220 must also formalize the' type of transfer (asynchronous 
vs synchronous). 

It can be further recognized that there are really two kinds of synchronous transfer, one iri which the semantic 
meaning of the reply is directly tied to the send and one in which the two are disjoint. The message passing library 220 
was designed to support a wide variety of message passing models. Its base function for data transferal is the same as 
that for the capability engine 300. But it also includes tools to facilitate non-layered support of popular forms of Rerhbte 
Prccedui-e Call and Interprocess Cdmm^^ - 

4.1 Remote Procedure Call 

Remote Procedure ball or RFC can realjy be distinguished from the message passiiig library 220 function, larger 
cloth fi-bm Whibh it is cut, by a se^^^ ■ 

1: A send does not return from the call to the message passing library 220 uritil thfe message has been delivered 
to its target^ • : , . 

2; The data in the send and receive portion of an RPC'is semanticaiiy linked to the extent that a" message of the' 
20 same fohriat is serit and j-eceived. ■ - 

a) the incoming and outgoing message share the same forrnat 

3: The RPC systern must be capable of acting as a proxy. It must be able to simulate the call of ia local procedure 
by acting in place of that procedure. It must transfer the associated data to the task space where the rembt^ pro- 
cedure lies, await the remote procedures processing, return the results to the callers space and firially make all the 
incidental changes like buffer rerribval or creation for the class of procedure calls supported. 



IS 



25 



40 



The third point may not seein like a restriction, indeed the proxy notion accounts tor the separation of transmission 
inforrnation as a substructure of the message. In a sense though it is. In the proxy case, the p^raimeters in a message 
30 buffer are exactly those sent by the initial cailer. In many languages; this creates some restribtions: Ih C for instance, all 
the direct variables are of fixed length arid data cannot be passed back in a dii-ect variable, this allows the RP^^^^ 
syste'm to make (Derfonmarice enhancements: Opportunities for perfbrniahce ehhancenrteht bas^^ specific lise i^ in 
fact the reason for formalizihg the support of PtPC, the restrictions which distinguish it allow for^dditiohal bptifriizati^^ 
Restriction 2 is actually a guarantee by the message passing systern thM'a client initiating a cei 
35 in starting that call and activating the associated server in a hoh-restartable way only td f^^ 6dt th^t a lob'sely f^irby 
reply does not match the cljeiiVs expectations. The'semantib link betWeeh request and reply has Implication^ the 
meissage control structure and the registration service. Because the request rhessaige and the reply me^ 
the same format, it is most natural to have the message control structure contain informatioh for both thb' send and ^ 
receive, coalescing the cbhtrol information rather than sending two structures. Coalesced or not, the fact that blight iriust ' 
dectaire the entire bperation'upbn RPd'has ah impact 

ov^wrt^^bpUo^^ a^xepf a^^ft^ ihb^ming biipiVHfess^^s^nd^'m^ 

structure check Because the blienrs serid^^S^ informatioh. R^^tHct^^^ 

sorne addrtiorial^ and register possibly mbre^hanbhb m^^e^ 

format g^^^^ but of the asymmbtric nature of the client/server relationship, the sbrverrb'gi^t^ 
cohtrpl structure. If there are twd clients which send exactly the sarne^format messagie but wish to receive the rbpiy d^ta 
differently the server must register two mbssage bbhtrbl ^tructurbs to support therti bothr ' ^ 

the innplications bf restriction 1 have been considered in detail in sbctibn 3. thb i^bbptiori of synchronous message 
passing not only leads to lower CPU overhead in data and resburce transfer, it also decreases kernel level resburbb 
utilization and makes it more predictable. 
^ Fig. 26 is a diagram of the RPC transfer The message control structure is kept in the mesisage pdssihg libra^ 2^0 

while the server is active in anticipation of the reply If the message was complex but was accompanied by a large 
amount of direct data, the server can avoid sending this data back on the reply by sending an override message control 
structure with zero size direct data fields. The message passing library 220 will use the override message cbhtrol structure 
to find the by-reference, capability, airid other port fields ih the message buffer sent from the server and will fili ih client 
^ buffers, or update the client's double indirect pointers as appropriate. The client message buffer is, of course, riot >A/ritten 
back to the client. 



45 



EP0 704 796 A2 



4.1.1 Scheduling Alternatives and Threading Models: 

There are two major models of RPC support with respect to scheduling. The active and passive server models. In 
the active case the scheduling information associated with the client's request is thsit of the server thread. In the passive, 

5 It is that of the client. In the active model, the server can be observed to directly commit a thread to the receipt of a 
message on the target port. The client then sends a message to this port and blocks waiting for the reply. The server 
thread returns to non-superyispr mode with the message and proceeds to process it, returning with a reply when process- 
ing is.cornplete. lri the passive. mod the server as owner of a port, prepares a thread body, (it prepares state and a 
set of resources for an incoming kernel level thread). The client does not so nriuch send a message as enter the target 

^0 server's space with the kind of restrictions associated with a traditional kernel level service cail, le,, start execution at 
a target mandated point, process incoming parameters along previously defined lines. 

In the case of RPC the assurance that the client will block while the server is working on its behalf is very helpful in 
supppf:tir)g elements of the passive model without having to expose an actual passive or thread migrating model to the 
user level. First, all kernel level temporary resources associated with a client thread at kernel level may be borrpwecl by . 

'5 the server. The thread stack and other temporary zone space are good examples. The client prepares a message for 
transfer. th^ server is then allowed to borrow the buffers which hold the results of that preparation. In this way, there is 
no distinguishable performance difference between the two nriodels. Indeed, whether or not a transfer is recognizable 
as thread migration has more to do with the naming of kernel level resources than with the actual implementation at 
kernel level. As an exarriple, in a recent paper, calls to threads were transformed to associate thread leyel markers such 

20 as the thread port with the thread body instead of the thread proper. In this way, the portion of the thread associated 
with migration, the thread shuttle will become effectively anonymous. The effect could have been achieved another way. 
Staying within the paradigm of active threads one can enumerate the characteristics of thread migration as separate 
optioris.The most important is, of course, scheduling. If the server thread in the active case inherits the client's scheduling 
characteristics, and the kernel elements of the thread are anonymous, there is near performance and functional equiy- 

25 alence.betweeAthe^ 

In the active model an actual runnable thread is created on the server side. This may or may not be used for other 
activities, in either case it is eventually put to sleep awaiting a receive. If the port is a passive RPC port, kernel level 
resoqrce, even the schedulable entity may be discarded. (A port level scheduling information template for lost state 
would have to be made available for aborts j When a client arrives at the port vyith a message, the ctierit bans its kernel 
30 temporary resources and its schedulabje entity, effectively Its shuttle to the server thread, now effectively a thread-body. 
The client eritity. ino^^ 

Jhiere are.sQme advantages in expos irig a p model to the user level. Certainly, easier thread body 

resogrice management is one of them. If an active model user wishes to move the ^hread_bpdy" from one waiting port 
to another. It must .exercise an abort. Expos^^^ actual resource queues tor thread bodies to the appiicatioh level 

35 would allow the user to mqve^^ with simple pointer manipulation. Further creatipn and destruction of thread 

bodies; is jess expensiye in the exposed case.Tfiis might gjve a smalt advantage to the extrerneiy dynamic sender case. 
Depending on thp exact natu^ interface it is also ppssible in the exposed c^se tp ailow fpr resource pool ing 

betvyeen receive pprts. Letting separate ports draw upon common thread body resources. The method would be some- 
what nnore flexible than pprt^sets in that thread resource could be subsetted but nripst pf the equivalent pooling capability 

40 coujd. be supported of the 

mcKjei. (When that rnodel, uses anonymous.kernel resources) with an a^ctive Iriterface. the world of asynchronous rries-^ 
sages.j^wjn.be^sf^^ but iippo^sibje.to the active and passive 

mcdeis^wheo. (incomes to .one-way^ sends.. . , ^ . 

.User leyel rnpcJe the availabilitvf of state information on the depth of calls arid path of mig|ratfng threads 

45 woulcJ,; oi.c:ourse, fgrce the, exposure p^a migrating thread model. Shuttles wpuld no longer be anonymous and would 
carry with them information regarding th© recursive depth and path of ongoing calls. Only such a direct requirement fpr ! 
message passing Iibr;ary22p supported state is expected to force the need for migrating thread exposure fiowever. Even 
reaspnable abort serriantic^^^^^ 

50 4.1.2 01 ienl/Seryer Juxtaposition: 

Client/Server juxtaposition is characterized by the synchronization of the client send and server receive. In the case 
of RPC, if the server arrives at the receive port before there are any messages to receive, it blocks. If the client arrives 
before the receiver it blocks until the receiver arrives. This in effect guarantees simultaneous access to both the client 
55 and server space for the purpose of message transferral. Though client/server juxtappsitjp^ can be achieved in somp 
circumstances in asynchronous communications it cannot always be guaranteed as it is in the case of RPC. If an asyn- 
chronous send is attempted on a port upon which there is not a waiting receiver, the message passing library 220 must 
produce a message and allow the sender to continue. 



EP0 704 796 A2 



In synchronous transactions the abifrty to guarantee that a sender cannot continue until the receiver has obtained 
the data snapshot rneans that actual messages need never be created. This minimizes expensive capability translations, 
message creation, message parsing and free operations. When a message is created, all by-reference types essentially 
revert to capabilities. The memory regions associated with the by-reference area must be copied (In one form or another, 
copy on write, copy maps and such are beyond the scope of this paper) and pointed to out of the message. This effectively 
creates a capability. The capability is anonymous which saves the target space mapping costs but it is still quite expen^ 
sive. 

Eyen in the case where an explicit message must be created, by-reference types are superior in performance to 
capabilities because they still allow the receiver to map or write inconning data without doing a specific call. Further, 
some small by-reference fields, might avoid capa^ by tenipprary. con version to direct data. This seems 

especially likely for the sen/erjemporary examples. , . 

Assuring client/server synchronization also reduces the need for kernel level resources and leaves the remajhing 
resource needs, more predictable. In an asynchronous world, system lockup through resource pver-utilization can occur 
when lop mapy messages left waiting in queues. An example can be easjly constructed. Thread A sends a rnessage 
to thread B, B, however, is busy processing an eariier request (possibly from A). To process this request,*B must post 
messages to several other tasks. Each of these messages requires a large; amount of space. Each 6f the subsequent 
tasks must, in turn, post a message. Jhe system designer made sure that there wpul^ enough resource to run the 
request,. t)ut failed to take into account the storage that additional waiting requests, on the thread B would use. The , 
system halts, or fails,, unable to create the messages that the , tertiary threads need created in order to service thread B. 
This particutar prpblem'pan be overcome, and indeed memory limits and.such can be placed on ports in an.effqrt to 
manage the problem. Nevertheless, it isplear that asynchronous rriessage cr^eation causes a resource utilization problem 
which requires application level attention to avoid resource exhaustion. Universal user level management can. become 
impossible In a conriplex system with multiple personalities and varied applications. Solutions could be constructed which 
required mujtileyel operations to reserve alj the necessary storage before beginning but this sort of transaction processing 
has prpblerns of ijs ovyn and i$, of course, inherently synchronous in nature. CliepV^erver siynchrpnizatipn can reduce 
kernel respurce requirements to some small number of bytes per thread, in the system. ManaLgement of application , 
specific resource, of course, remains.a pptentiatly difficult problem but kerne! level resource management for RPC might 
reasonably consist of nothing more than controlling the number of threads the system can have in existence at any one 
time. 

4.1.3 RPC Specific Message ConUol lnforma^^ 

The message control structure associated with an RPC call refers to. both the request and reply portion of the 
message. Besides the fact that it is the very essence of RPp to semantically link the send ^a^^ 
turns put to be convenient with respect ^ , , m 

in the message iibra^ version of RPC, the message buffer format and in mos J cases buffer content dpes not char;ige . 
on reply, the message buffer represents the parameters sent by the' original caller of the RPC. For many languages 
(including C) the parameters are not directly alterable by the caller. This means that in our implementation of RPC, it is 
not necessary to copy the message buffer back to the, client. Absolutely requiring the fpnji^ 

makes it possible to always have one message control structure descriptibn of the message buffer instead of two. Having 
one message control structure describe both send and receive represents ar^other^u cpntrpl infor- 

mation. Only one descriptor is necessary per parameter instead of two. The inforrfiation associated with the receive and 
send side buffer disposition in the case of by-reference variables is kept separate, making decomposition of send and 
receive side specifics convenient. 

There are two drawb^^ replyJnfornnatipn. ][he, first ^ the issue of verifying the 

compatibility of a client interface when the server is using override options to alter the disposition of local buffers asso- 
ciated with t^e call. In Ihi^ case, the remote tsuffer disposition bits associate^ yyith one or nriore of the parameters of the 
incoming message are no longer valid This problem has been gotten around by collecting all of the bits associated with 
server buffer disposition into a field. The server may check the incoming message control structure with by the same 
byte by byte comparison except for the addition of a masking operation before the comparison of paranrieter flags fields. 
The server is in full control of the compatibility check, based on the type and scope of the override, the mask may be 
used on all or some of the parameters of the incoming message. The second drawback is centered around the 
server_dealloc option. Special care will have to be taken when it comes to server_dealloc, the server may be compelled 
to check for this option and where it occurs send an override back on the reply. This is sub-optimal in the sense that if 
a client persists in sending messages with a server_dealloc on a parameter and the server persists in doing overrides 
in which the sen/er_dea!loc must be overridden, the server must continually send a reply message control structure 
and the message passing library 220 nriust on a call by call basis consult it. In the wprst scenario, the server would check 
the incoming message control structure every timerdoing a special check for the dealloc. This is not a large disadvantage 



EP 0 704 796 A2 



over a hypothetical non-coalesced notion since in that case the server would have to send a message control structure 
with every reply. But it does require an extra check at user level on the part of the server arid a cro^sb cdnriparisoh at the 
messaige passing library ^0 ibvel. This, of course, can be avoided by having the client send a message contrbl structure 
that does not contain server dealloc, or through registration The sen/er can choose to register a message control struc- 
ture which does not include the server_ddalloc option and return the registration id for this to the client. 

4.14 The Subclass of Supported Procedure Calls: 

to be sure, when ernulatirig pifocedu re calls it is simply riot possible to support the entire range of local procedure 
call behavibr dertain things are ruled out right awayl^ TO be^'no side effects on global variables not associated 

with tl% jSaiBLnrieters of the call but accessible fe'this daller and thfe c;^IIee Viaf Birect acces^: N6 inner scop^ trickk, gither 
A called procedure cannot act on a variable which has been declared as a local variable of the caliirig jDrobedure or one 
of its ancestors unless that variable 

Beyond thesib obvious examples of side effects, however, lie a large set of perfectly vaiid calls which we do not 
support. The largest of thes^e is rTnultiple ihdirefCtion br degree > i2. ThoOgh lt could be suppbrted;1t wa^ die^^^^ 
worth' the trouble to support variable w^^ dbgi-ee of indirect iori was above 2/ Double indirection got the nod because 
it allows the caifee to change pbinteir value^^^^ ; 

In spite of the restrictibhs, the sublet of procedure calte' supported'by RpG is large. It was a design goal to allovv 
both uriinterbreted message buffers built bh the oHgihal caller's pararneters and the transparent use of RPC. Clierits do 
not have to send a destination port as the first parameter ph their call, the rriessage passing library 226'has the ability 
to sehdihe destiriatidri port'ih the transriiission data section. -Subclassing could then be carried 6ut\hrbugH library swap 
or chariCie in library resolution path, further, clients are allowed to return data and pointers oh'theirfuriction cairs. The 
message passling library 220 supports this through ari dptibriai separate status returri for transmission status / f^bs^^ 
importantly, buffer disposition classes have been set up to support a wide range of actions the called funcU^^^ 
take. The eiibrit cah fuliy expect the called jDrocedu re to remove a buffer aifter looking at the data, or allocate a huffier in 
which to rMUrh data: The irange of supported sennaritics is determirted'by the buffer subclasses defined in Sbctidh 2. 
Beyond difect suppbrt, the client side proxy routine is capable of supporting Ibcal semantics, i:e., if the client was ex- 
pecting the Server to use a particulai- heap source wheri allocating a buffer, the proxy "might allocate klich d biiffer* Using 
local calls and change the RPC call to reflect a write into this buffer This would, of course, cause the proxy to rewrite 
the message buffer and would have some effect on performance. 

Support of such a wide class of procedure calls was designed to ease the path towards co-residency, (Allowing 
callers and callees to be written to one paradigm and yet exebute bbth remotbly ari'dibcally without perfbrfhance' loss:) 
The local case was the most performance sensitive. In order to get the best possible performance here, the paradigm 
had to look as riiubh as possible like a local prbcedu re call. This v\^s achieved, the calls can be indbbd, arid are iii 'sbme 
cases (those on which the client dbeis iiot send a destination port), local calls: The enumerated list of properties belbw 
along with the restrictions mentioned in the first two paragraphs of this section characterize the supportbd set of prcDce- 
dure calls: Prbc^ures not venturing beybhd these supportbd, even if they were heVef wirittbri to Vvo^ a 

message passing environment. 

1. Function retuni values rtiay be full w^^^ 

2:'B|-f|€^efehced^ ' ^ ' - ^ ^'^^ • ^n-^ , 

'3: the callee^ mik]^ ' - ' ' ' ; ^ ^ ; - ^ 

4. The callee niay create a buffer and supply it to the client through the setting of a dbubib iridirect pbinter. 

5. the callee is capable of ^^w^^ into a biiifter supplied by a client arid if that buffer is not big enough, either 

a. returning an error 

b. truncating the data 

c. getting a new buffer and pointing a double indirect pointer to it ■ 

6. The callee may push the data associated with different calling parameters into a single pooling buffer. 

7. The callee may push data associated with dififerenl calling parameters into multiple pooling buffers: 



EP0 704 796 A2 



There are some restrictions pn ports when they are sent as icirrays. The pqlhter to the array can of course riot be 
more than dpuble indirect and all the ports in the, array must have the same disposition' Since ports are a data type" 
special to message passing, these restrictions might be more properly looked upon as restrictions on the data type. ' 

5 4.1.5 BPC Specific Use of the Transmission Control Information Componont: 

' ■ . . - 

RPC uses the transmission control information subcpmponent in the same was as other message passing models. 
It can alter cletaults for NDR state, it can pass infpmnation back and forth between proxies via the trailer The RPC, 
however, has some additipnal needs which must b,^ met in order to support rriessage passing transp^ its clients. 

10 The two major options are STATUS and DE ST I N/d*)bN PORT the RPC subsystem supports the reUjmd poiritesrs arid 
data on remote function calls. The default behavior for the message passing library 220 is to combine procedure return 
status and transmission status much as the CMU mach message passing service did. In order to separate the function 
return code information, the client side proxy must request the STATUS return option. 

The function return status is then placed in a field in the optional header area. The transmission status is returned 
^5 in the normal way. This preserves the message buffer, allowing it to appear just as the origlnaj caller's parameters did. 
The transmission section destination port override allovys the proxy to d^terrn in e the. destination of the message withput 
changing the message buffer. Againlhis is meant to support the dual notion of the fastest passible user level interface 
with support for transparent procedure call emulation. It seems less likely that 2 way J PC 122 for instance will want to 
hide the destiriation port notion from the bailer, but the pptipri remains available to all proxies/ As with hpn-RPC,,uses^ 
the trailer will prove useful for rts security token, sequence numbeVr scheduling ihformat^^ and possible routing ihfor- 
matipn. 

4.1,6 Priority Based Queuing, IWaking the Client Queue Pluggable: 



20 



25 



30 



35 



Th? Capability engine 300 has gi^ngrjc interfaces fpr que of incoming messages, the r6utine 

actually called is determined by a field in the poii structure of the target port, the Capability engine 300 consults t^^^^^^ 
field and calls the procedure pointed to by it. The Capability engine 300 also has the interfaces that are called to alter 
this field. By setting up the proper queuing is the order of the day. the queuing code can check the schedulable, entity 
associated with a blocked thread or a field in the message and queue or dequeue a thread/message accordingly. This 
is the basic idea behind the message f^asjsiog library 220 and capability^ engine 300 support of rnu Itiple queuing methods . 
The capability sender queuing call is made (either through svc or directly) with a kernel object as parameter. Th^ first 
portion of both the messagis and thread structure are kernel objects. The queuing procedure itself determines the type 
of the kernel object via a type field in the self defining data structure (the kernel pbject) and proceeds accord in^ly..^^ J 

RPC, qf cpurse,^^^^ ^9 F]es^age spePifip tu 

unused. For detBi|%pn the expecte^^^ infprmation jn me^^^ dual queuing of 

messages and blopked thrieads.. The queuing function of the ^ 3qO is not exp^cb^^ be called 

directly in the case of RPC. ralher, it is expected that SVC cails which encounter a Shortage of server resource (either 
active thread, pr passive thread body notipn) will trigger the capability engine 300 to call the queuing rnechanism, 
In Fig. 27, the Port is shown inside the capability engine 300 becauset ports are only accessible through capabilitv 

calls. . , . . , _ , , . . , 

4.1.7 Support For Message Server Spaces, Demultiplexing on Message ID: 

It is often the case that a series of functions have enough features in common or ppntribute to a^single purpose in , 
^ such a way that there is advantage in describing them as a set. Further, if these functions share the sarne resource and 
information. base it i$ innportant that the me|mbers of the set not be physically divided, In an^ effort to support this and to 
economize on pprts, pprt level demuftiplejcin^ carried forvirard from ^ CfyiU machlmsg rnpdel. (it is also riecessary 
for compatibility.) '[ T,,^^^^ / , / ! ' 

Thf message id appears as a field in tiie hpader, it deterrriines the interface and format of a rriessage amongst g 
series of interfaces associated with a single port, the message Td is not a primitive of the rriessage passing library 2^^ 
in that the library does not use its value in message handling decisions. It therefore could be relegated to the trailer as 
an option. However, it has been the ovenwhelming experience with CMU's mach_msg that demultiplexing RPC's are 
preferred over one port, one method or a rnore general and interpretive IPC 122 with its subsequent parsing costs. Fpr 
a fixed set of interfaces, the message id is indeed an optimizatiori of interpretive I ^ 

costs, f=^or a fixed set of interfaces, the message id is indeed an optimization of interpretive (PC 122. It is at once faster 
and nriore powerful By conventiori. the message id transmits semantic as we^ as format information. (Interpretive IPC 
122 here is defined to mean the nnessage format is not known. ahead of time and the message control structure must 
be consulted:) 



50 



55 



/ 



EP p 704 796 A2 



In Ihe demultiplexing model, the user level server rs comprised of a primary procedure which places messages on 
and recovers messages from a port. This ptwedure does sotie general prcxessing, (rttessage t6d larg^. restart h^rid ling, 
buffer handling, etc.) and in turn, calls a server skie proxy, The proxy called is deterrriiH^d bi^ the rries^age ^sh^ id^ 
This server side proxy does the discrete function level specific setup and checking. ' ' ' ' ^ 

5 The general server loop does get involved in discrete message processing in one place. The message cmtrol 

structures are made available to it through a table indexed by message id The byte by byte chfek code generic/ ft^^ 
just the data involved which is function specific. Further, alterations to server side options are necessarily server vyide 
in scdp^Jt is the ge^ is the mo^t appropriate place for the rieces^ary adjustrhehts tb t^ 

side format checit It is also tr-ue that the ^rver side stubs tenci to t)b automatical (y geAerat^^j' thi^ mak^^ mk\ a \ess 

10 convenient target for receive side buffer disipbsition customizatiorf: An outline of the fibWo/ bx&Lition iri Wtyp^ hies^lq^ 
receive/send, ^s shown in Fig. 28. * ^ • ^ - ; / : ■• ■ . ■ ■ ^ ■ 

• Primary Server Function receives Message ^ 
15 • Prjmaiy Server Function checks status, i.e.. hnessage to iargfes' and does appropriate high le^el hkndlihg. ' ' 

• If registered, primary server cheeky index table to relate registration Id to message id: ' ' ' ' ' ' " ' 

• If not registered and client not trusted, Primary Server Fuhctiori uses rri^sisage id tb ^^t me^yge control stfjiiMUrW ^ 
20 template and c^eck ag^ (obviously rec^uested) sender rriess^ge cbntrbl sir uBt^ ' " ^ ' ^ 

• Primary Server Function uses message id as offset into table to get the proper proxy function. Primary Server c^lls 
proxy function. 

25 . Proxy function does any necessary transformations on incoming data. These transformations are f unction/applic^- 
tion specific arid bmside of th^^ ^iitornaticprbiq^^g^^ 
tools.' " ■ " ■ ^ ■ ■ ■ ' ' ■ -■■^ ^ ■ ■'■■■''''^ ■ c-^-:;n,-:v^,.--.; - n^i-if,:, 

Proxy function calls the targeted endpoint. (the callee) ' - ^ - , i ; U 

• Prc&yjfiin^tioh does 3 u.:;? 

• P^^^n/ SerVer^F it Is not allbw&d to increaisel bf the^Mer linles^ ^ribth^r' 
buffer js used (There may be^ 

^ ^!^^'} J^^ primary serveV option^ ihcludes a rMv! message i&hKbi^^ti^U(^tijre p^^^ rhe^i^ 

^^^P^"^^^^ its size (rare): (Such custbp? 
product; the ^a^^ wrrter rriay be ^ to seryer4o6p^:^^ ■ ^--^ - 

• The Primary Server Functibh calls the message passihg iibr^ry 220 with a serid/rcV: The supj^iSd fv^^r i^ {ibir^tih^ ^ 
40 at the reply strtJCtures. (The messa^6 buffer WhicHin turn is pbiritihg^tl^ifSbi^ cgl^tain 

data to be supplied on the reply) The header is at the top of the receive buffer. The receive buffer is big enough 
hold any of the incoming message headers and their temporary data 400, or one of the oversize options may be 
encountered. -"^^ ^^U^---'---^ t^rn^^i^JHfr^m^: o--,.^.;;^ ^p.^.^..^.^ ^y;^ iioq^-y.^':: T, ^ 

45 4.1 .7/1 Dynamic ■-^■;-;:v ' :yy . • - 

. t^® ?^PP^^^^ dynaniic lihkirig arid co-ire^idency i^ veiy It ailbWs the downlokd 

a target sjb^ce: Proper implefneritaibh will allow a do^AMiload&jfuhctii^ t^ aibc^l pi-ocedure^all'^d perfofm 

possibly as a local prcx^edure call without any additional overhead, effectively bypassing client proxy, server, and server' 

50 proxy ;outih^s. When the function call is aware of message p^feirig, it will still be necessary in the ibcal c^^ to h^ a 
proxy inserted between the caller and callee, but the cbnriplexiti^ aind bvertie^d of this proxy Will l^b greatly red W\^n 
contrasted with a remote call. 

Co-residency also supports the remote setup of servers. To support this, co-residency must go beyond simple down- 
load and link furictbnality. In the case of a pre-existing server, download and lihkfrdrVi an external siiLird^ cx)uld tj^ 

55 to alter one or more Of the server proxies and their endpoint rx^utines. In b^ to dblhis; howeve(;iHe re^ 

would need to know the name of the proxy, pbssibly the name of the endpdi^ltarid have general Write pbrrriissioh, i.e; 
the task port for the target task: Just to support:this functibhsility w some degree of pratectioh, a 'cohiplic^t^^^ 
user level utilities with would have to be created; Thes^ utilities would be trusted and a target task would entrust its task 



EP 0 704 796 A2 



port to them. Other tasks wishing to download function would have to communicate with the download target through 
these utilities. . , ^ ^ , ^ 

Even if the cpmpiicated application level tools were acceptable, the level of functionality really isnl sufficient. Addi- 
tional function requires a high, level of communication between the target and the task attempting rempte download. 

5 The caller cannot start a server, or add a new message id to an existing server without some rnethod outside of the 
defined notion of cp-resid^ncy. . 

In order to suppprt these notions in a simple straight fpnward rrianner, we need support for a dynamic server mpdel.^ 
A task wishing tomjake itseli avajiable as a dynamic sen^rrtlust create and export a port whiph rr^akes the series of 
server creation, manipulation, and shutdown routines available! A sen/^r fpr seiverc;^^ sefyeiyserver^^ports^ 

10 presented by the server library. The default server lopp is not just a shared library^ 

a threadless instance of a server and retums a handle. This handle is used by subsequpm pHange pptipnal 

aspects of the server instance, add or delete server threads/ associate proxies and by consequence their endppints, 
add or remove receive buffers, or shutdown and clean up the server instance. After usiing basic co_resident utilities tp 
dowriload specified cpde into a target task, the remote caller wpuld send a server_create riness^ge to the spryer/server^ 

15 port and receive a handlp back op the reply. The caller rriay haye supplfed ^ set of or mc^ fill jn the 

proxies through subsequent calls, The caller^has: an additional call yvhich i? not one of tjie calls exported by the server; 
package An extra call is needed to create a thread and then direct that Wread to ^ssoci^te itsptf with the target seri/pr 
instance.. In the passive model, it is ppssible to^simply provide the thread bpdy, respurcp^ to the receiy^n but in the active 
model, the server acquires threads via a cajl by the . target thread There is^^h adyantage to having the routine built in 

20 this way the target iserver task is free to adjust post processing or customize threap state or rpSQurp^^ specific 
needs. Because of the notion of server instance, a server persists even rf its threads exit the server. In this way, excep- 
tional conditions can cause a thread to return from its run_server call. The task is then able to customize exceptipnal 
processing. The thread can then be returned to the server loop. If the exception the thread is returned on is a simple 
returr)_server_thread, the thread is free to re-associate, itself with the server, run some other unreiatpd task or self-ter- 

25 minate,^ .. ■. ■:■ ■ f--:;-^.-' ■ ,■ .^. ■ \- ■ ■ s..^ ■ .s-v 

4.1,8 Anonymous Reply Support: 

In the message passing library 220, the semantic I irik established between request aod.reply regarding ^'^er 
30 disposition and rriessage buffer f prmat js separated from; the expcutjon pa\h and the f espu rce riec.ess^^ parry, out 
the request and reply. Through a separate set of fields in the rriessage cpritrpi strupture, options ori the port^ a^^^^ 
support pptions in fhe tasks themselves, thS, resource used to carry out the request and^^ 

In the simplest. fastest case, there is m heed for an explicit reply port, thp cljent is simpjy blocked waitirig for^ l^^ 
completipn of ^the rerripte procedurp call, the, sender or ,at jeasl a Ihrpad of the server is.^dpd^^ the duTatio^^ 

35 calltp completing the rennpte,pi^ retumirig th^ The simple.case, provicjeis the oiessagp^ pas^mg 

library 220 with an opportunity to employ the same technique used to optinnize by-reference data trarisfer in pffe^^^^^^ 
by-passing the explicit port watt that occurs iri asynchronous and explicit receives. The nriessagte papng Iibrary'220 
can in this case, avoid the expense of contacting the capability engine 300 both to do the wait and to migp^a^^end pr^^ 
send_once right into the server's space. There are cases, however, where for throughput or trahsmtssiph controi ri3asons 

40 more flexibility is needed. nod A b^zmriT IMJ 

Because of this flexibility, in some circumstances, an explicit reply port is required on either the server or client side 
in order to. keep. track^ th^T^P'Yrl^rgeJ^.ThGjug^^ reply pcM| iri <^dei-^^ 

allowforiri^ ?ft'^^^®.?,)?^F3f?3&'^^ 

messages,, praiessing 1^ ^r^^hen re-p?ta^^ for repl^^y .§Ping a^ ori^ the expl jci^ repjy port. An 

45 exarnpla of this behavior can ^p the rBceiyeijsi^^ of thrBcb jbor^ic^^ ^ , . , , , , , , ; . . . 

Some appiicatiori enyirmmm^ ^i9hgiiing ofily at^t^^ calls. Eye^^^^ 

system is not strictly asynchrpno^^^^ to be received within a bounded tirne. this pah 

be partiaNy deterrnined by t^ frequency at yyhich sen/ice caljs are made by jhe target ccxie, the target ddhg riull service 

calls when ^e ex^ .'r'^^^Y?!?- ^^^^ 

50 these service calisrhow^^ the systerh can tolerate, in this case Jt rhus^ possible to abort the 

target out of a block on send (or request), a block on receive (or reply) and possibly put of sender processing via an out 
of band abort signal to the application server routine. If the client side proxy is set up to handle it, the send side abprt 
with signal is straightfonward. The client awakes with an aborLnotify signal, prope^ses it and if it waoj^ restart^ jhej^ 
RPC. If the sen/er. is already processing the request, hpwever, the client js wai|y[ng o^ 
55 thread^bort^notify signal during this period, the client has tp have undertaken, the fipd viMth an explicit reply port, in 
this way the rnessage p^issing library 220 can send ari abort_notify message to the client and the clippt can re-establjs;h 
its wait on the reply. If the client did riot supply an expiicit reply port, thp message passing system will pend the £ibprt_,notify 
state and include it with the reply comin^^ 



EP 0 704 796 A2 



In order to avoid an explicit reply port on the server side, the server must be able to guarantee that the thread 
sending back the reply will be the same one that was associated with the request. In this way, the client awaiting a reply 
can be registered in a structure associated with the server thread structure. The server may not be able to guariatrit^e 
this as it may be subject to a user level threading package and as a result subject to some form of thread multiplexing 
at user levej. Such multiplexing is often done in an effort lb support throughput real-lime, or some form of explicit 
serialization. • 

Seamless support of anonymous reply port optimization requires that a client decision with respect to reply port be 
hiddjsn frorh the server and visa versa. The nnessage passing library 220 achieves this with the algbrithrh^shown in Figl 
29. The algprithlin^^^s^^ the point that the meissage paissinig system has both the clieht send and the server receive 
in juxt^positiOT^ ^ ^ ' 

Case 1 , of course, gives the best performance. However, case 3 should perforrri better than either 2 or 4 because 
it is not necessary to create and place a port right in the server's space: Case 3 rtiay perform nominally better than ca^i^' 
4 because the anonymous port is a light weight affair, not requiring the state and setup of the nbrmal port types. 

Upon return from the request, the server thread's data structure is checked for an outstahding reply. This will b^ 
present in case 1 and case 3 above. If this is an example of case 3, a second field in the port stiructure poi to tfie ^ 
client reply port; If the client is blocked on a port, it is remdved. If it is not btcKjked, the server is made to wait on thb port 
as a serider. When the client is available, the r^ply is delivered, the client iHread returns and the iserVer resources or 
thread are free to take on another message. ' ^ • ; ^ ^ mv / . ' 

If the sen/er's thread structuris does not jDoint at the client, thei^ei musVfe ah Explicit port i^ port field of ^ 

the seri/er message call or ah error is i-etumed to the server the client thread is retrieved from this port if it is ther^ and 
the triahsfer proc(Beds/ If it is hot pr» the port, the's^iVer thread block awaitihg it! 

4.1.9 ABORT Support: 

Abort support is a complex issue made more borhp lex by the fact that abort support is not one fiinbtion but three' ' 
Standard thread_abort will abort an ongoing message or wait without regard to restartability. Its use is limited, therefore, 
to drastic situations such as thread termination or at least termination of the stream of execution utilizing the thread. 
The second form of abort is thread_abort_safely, which should probably be called thread:lab6rtichecl^6int 'dr 
some other better suited moniker. Thread_abort_safely's real purpose in life is to bring the target thread quickly to a 
state vvhere a signal can be safely given. The signal is asynchronous and its mech^ism nnust nbt'be dete^^^ 
synthrohpus stream of execution. It is, therefore, irriperative that Thread_abortis&feiy be restartable: Thie third fbi'm of 
abort, Thread_abort_n6tify like thread_dbort_safely is used foir signaiiihg. Unlike thread_labbrt_safeiy, 
Thread_abbrt_nbtify delivers its signal directly to the execution stream of the tari^et thread via return status ori d message ' 
passing call. Its purpose is not to guarantee a quick restartable return from a kernel cail so that a user level signal 
processing routine can be called and returned frbfh. If the thread is running kt User level, it posts a ribtify state and bides- 
its tiiTi^. Thread_^bbrt_hotif^^ a signal on return, restartably abbiled or otHerwise, frbm^a rriessdge" 

passijig call. ^■y--- - ■ ■ - ..^ v.-j-.. ^ ■. 

The aims of the thriBe typ^s of abort are cfifferent enough that they influence their implerrientation, therefore they 
will be considered separately. *q 

4.1.9.1 Thread_Abort: i;oO^^^^^^ v;;^'v=V"'^ 

^' There are two ^f5es of wart that a thtead is capable of exerblsing, a kerherbase'd one and an ^irt^rhal 'seirVer based 
one. Tn the c^s^ wne^ of MBott i^ hot wbrribid about thread restartability, the bniy iri^port'ant bbnWidemtio^^^^^^ 

wakirig up a Waiting thread are the server or keririel resources and state. Tti6 thread nfiay be' returned to usei" level witH 
a thread_aborted declaration at dny time so long as the server/kernel are not left ih ah liridbfined state or wfth ah brphaiheci ' 
resource. Ih the kernel rt should be possible tb eithiBr clean up the resources syhchrbnousiy br less-desirably, cireate a 
care tak^r.ln fact, iii the hew, more modular microkernel 120 archit^ture. there^m^^ than waits ^ 

on rh^ssage passing send and receive. In any event, the exact disposition of ke^rnel resource recovery is beyond the 
scope of a paper on message passing. The server case, however, directly involves the message passirVg system. ^ 

In RPC, the abort function may find a thread either blocked waiting to begin a request or blocked awaiting a reply. 
If the thread is blocked on the request, abort is simple and restartable, return the thread with request_aborted status. 
The server was not yet aware of the request and no recovery action at all is required. If the thread is awaiting a reply 
the situation is much more complicated. ' ' ^ 

In the case of Thread_abort, an attempt may be made to stop the server as sbori as possible rather than letting it 
complete a now useless bit of work. The first attempt to abort the server is made via the port. A field of the port structure 
points to an abort_ri6tify fiihctbn. If the server wisheis to Support early terminatibn of work for an aborted client, it rtiay 
choose this method, the message passing library 220 passes a message to' the abort notification port corrtairiing the 



EP 0 704 796 A2 



port, and the sequence number of the associated message, (The port is necessary because the message may have 
been delivered on a port set.) In any case, the state of the port awaiting the reply will be altered such that when the reply 
is sent back, the message will be destroyed and the server reply apparatus liberated. If the port is destroyed first, the 
server will simply encounter a dead name for the reply port and may act to destroy the reply and continue. 

If the message passing library 220 finds that the abort notify field of the receive port has not b§en filled in, it checks 
to see if the server requested anonymous reply port. If it did, the server has guaranteed that there is an unbreakable 
link between a specific server thread arid the request. In the server anonymous reply case, the message passing library 
220 executes a thread_ab6rt„saf ely on the server thread and sends a signal iridicating that the message being p^ 
is no longer irriportani The anonyms reply port, if present, is destroyed If the client sent an explicit reply port then 
the state of the feply port is set such that the reply rhessage will be 

as if the reply was sent. ' \ : . . , ; \ . 

The client will return from its message passing call with a Thread_abort status. Thi^ status indicates to. the client 
that the message was aborted and the associated resources and data lost For systems which wish to use the less 
refined Thread_abort_saf ely, retry can be achieved if the client checkpoints its data before attennpling message passing. 
The server state is important only if the server maintains state between invocations. In this case,'the designer must 
insure that the server receives notification of client aborts and takes appr;opriate action. 

Frorh a ieal-time perspectivie, ther;e is a danger to proper scheduling of resources in the case where the ser^^^^ 
acquired the scheduling properties of the client. From a scheduling standpoint, this is effectively the server passive 
model where the client entity runs in server space. After experience of an abort. The client thiread is e^ffettivety cloned 
with one rurining temporarily in the server and one running in the client If the priority of the client is High enough, the ^ 
server thread (in the abort/signal scenario) might run to completion before ericountering the signal to terniiri 
server explicit reply case when there is no abort notify port, there is no atternpt to notify the se^^ a ciierit abort. 

It is only in the case of the abort notify port that the server, through the systern designer, can ensure 
of the clierit abort notification. \\ the active thread oh the. server abort nc^ify port is given high ^ ^ if the passive 

scheduling parjameter assigned by the message passing library 220 is of high prior|y, will be scheduled before and 
may preempt the' client message processing. The server may then set iiser level state to comrnuhicate with the client 
message p wessirig thread that it must ^ 

4.1.9.2 Thread_Abort_Saf ely: 

Unlike Thread_ab6rt, the purpose of Thread_abort_saf ely is not tp logically abort ah ongoing execution stream, it 
is nierely to ihtefrupt it. Thread_abort_safe!y needs to get the target thread to a state where a user level routine can be 
run in order to .deliver an asynch ronous signal. It then must recover in a way which is transparent to ihe syrich ronous 
execution stream. In CMU mach_msg, thread aix>rt resulted in a return to the rTiach)msg call with thread)abbrt^^^^^^^^ pr 
thread_abc>rtlrfav. These calls were restartable, bijt a sniall interpretive lopp and ah extra procedure call were required 
to check for and carry put the restart, \ , , _ . ^ 

in the riiessage library, there is.no user level retry worked into the nriesfsaige passing call. Thread_abdrt^saf ely and 
Thread_signai work together in such a way that the return stack is set to deliver the exception message and when the 
exception routine returns, a trap back into the kernej cKXurs. The return from ex^ 
structure and tle^^^ 
a cOmF^tl^ilify'o^^^^^ 

be jbVo^ugH^^^ iii tlie m 

below "Request an^^ have to be ripped f rbni their queues/ causing t^ its place jn the leiquest 

queue and requiring expensive calls to the capability engine 300 to talk to port queues. 

Thread_abort_safely is ur^detectable when it aborts a message from the request vyait just as in the Thread_abprt, 
case above. It is also undetectable in the abort from reply wait case. In fact, the client thresfl is not effectively i^embved 
from either the request or the reply waits. The resources necessary to carry out the request remain on the queue when 
a client experiences a fhread_abort_safely In the case of the actwe server, the sen/er thread is free to pick up this 
request eWn during an ongoing Thread„abort_safely. In the case of the passive server model, unless otherwise ih- 
structed (see real time considerations below) a shuttle is cloned and the server 'processes the request. Another way to 
look at it is that when Thread_abort_safely is experienced. The FiPC is separated from the thread body of the client. 
This thread_body is then endowed with a shuttle and told to execute an exception message routine. When the exception 
message routine returns, it returns to the kernel. The message passing library 220 then removes the shuttle and re-es- 
tablishes the client thread connection to the RPC. In the meantime, the entire Rpb rnight have bcdurrecl, including the 
placement of reply resource in the client space. There is a precise corollary for the active model except that it does not 
suffer from the scheduling difficulties. 

A link is kept with the RPC even during the Thread_abor1_safely call, should the thread be scheduled for termination, 
probably through a-Thread^abort, the-RPC is reachable and the reply port and send request wilLbehave as described 



EP 0 704 796 A2 



above in section 4.1 .9.1 . The server resources are guaranteed not to become permariently pinned on a zornbied repV 

There are several ways to view the real time issue with respect to Thread_abort_safeiy, It coiiid be argued that th^ 
asynchronous signaiiirig process is a scheduiable event in its own right. The exception message port would carry wjh . 
it a scheduling priority apart from the scheduling information associated with the thread targeted for the signal. In this 
model no action need be taken regarding the ongoing RPC. If the sigrlal is viewed as an act carried put by the target . 
thread itself, however, there is a need to adjust the scheduling information Of the RPC at least in the passive^s^rvet; 
case If th^ RPC has not had its request considered, the scheduling info can be^ttered to reflect suspend. This, of ,^ 
cours^: m^ eff§bt tfiebWer Of queued requests processed. If the request is already underway and th^dient a!^orU.|!fy. 
port is active; th^ message service dari send a rnessage tplhesen^r stating that tfie requ^ shouW t,^ susp^^ If _ 
L clieiithStify port is not active ^^^^^ 

assumed that the first non-inten^ention approach will win the broadest appeal as any other approach will effect con^pletion 
1ime,andlhisisinconflict with the attempt to make ^ 

4.1.9.3 Thread. Abort_Notify: 

As mentioned earlier, the main objective of Thread_abortjx>tifyisto del 
of the rnessa^e^assing seivice upcin return from a message passing calL In ari effort to ^leliver the si^al in a tmely ^ , 
fashion the call may be aborted, but only in a restartable fashion. Thread^abort_riotify only delivers signals f s P^rt of 
the feium status from message passing requests. For this reason, if a Th:read_ab6rt_notrfy is sent to a threap, either ^ 
not w^ting on a message queue or at user level, the thread is put in riotify^igrial slate and^^^ 
roaches a state whore notificatron can be delivered. - : y,^,pt: ,.^:- -- 

Becau^e tfiis notificatiorV methoci aoes indeed involve aborts., it is hot possible to avoid completion time anpmahes 
on the RPC as in the threacl_abort_^ateiy case above! This is a direct consequence of tfie need tp expose the signal^ 
to the synchronous executibn Stream.' Becausb 'Abort_nptify is ^ 

it is not a mainstream furictibn. Ah, option to ignore threadiabort_riotify has been included in the message heac^er, . ^ 

Thre^!abort_nckify when experienced by a thread blocked on the request queue, resqits in,the reques? being, 
removed from the queue and the client receiving a thread_abort_notify Jend message. There are no artifacts, the seiver 
was never aware of the request, and the client is free to retry the RPC or not based on its handling of the notify signal 

directed at it. . . . 

When Thread_abort_notify aborts a wait on reply, the client thread is taken, off of tfie reply wart queue arid retymwJ 
with a status of thfead^brtlnotifyjeceive:^^^^^ client is then free to process *e notifi.cation arid do a receive on the 
reply pbrt to cbntihue the RPC. Thread;!abort_notify will not abort a wait.on reply unless the client .has iTiacle il^s r«iuest ^ 
uS an explicit reply port. This was done because returning the client while it was waiting on .^n aooriymqus reply , 
would either hkve required the client to return through aspecial trap instead of doing a receive ori the rep Import or a^^ 
receive nght to th^ anonymous reply port would have to have been placed in the client's spac^. , ^ , . . , 

If the client does indeed supply an explicit reply, the system designer may feel compelled to take some action, |o 
effect the scliedulihg of the serverVeqiiest hkndlirig of the associa^ RPC, especially in the cas^ of the pa^ive server 
model The case differs from Thread_abort:^afely in that the synchronous code path^ is activated. The^ca^e is tree to 
go off and do anything, delaying its retuni to yyait for the reR!y;irK|ef nitely; JHs rra^ 
kbertaviofa^Ki^vvithWdfafi^^ 
will^l^bl^i^^^dfey^m^s|^f)a|^^^^ 
optg|;w(iteh;|^|:f)|s|^ 

message (of notify , 'ii ; : ; . i v; . n;. ■ , ' ■■^^^•> 

sHared memory regibri^ may be est^ijtished through the message pa^slih^ library 226'>«a tWpmean^^^ the explicit 
shared rhenwry by^eference pararneter With r^^ ovenwrite on the server side to, establish the shared region; or 
2 The passing df a share capability: Though either method.will establish a region, the first is considered rriore appropnat^ . , 
for RPc; and willte the ohiy case desc:ribed in^d^^^ this section; Passage of a share capability is less constraining 
to a sen/er The server is free to send the capability on to another task and » the write was a send ngfit instead of a 
send once the server may share the share region with others: This will create a common buffer among multiple parties. 
The sender does this by doing a copy sehd of the capability in a message passed to a third party befqre consuniing the 
original right in a mapping call. Though the fonwarding of data on message receipt wiH^occur with, ttie sending of a 
misaqe broadcast is not defined for RPC ahd only the target server will receive an overt signal that the message 
buffer has been filled or .freed. Capabilities cannot be established on shared regions, we are, Iherefprs, protected from 
broadcast in the explicit shared iriemory setup case. , . . , : , 



EP0 704 796 A2 



10 



20 



25 



The vast majority of servers are stateless, that is they do not hold information acquired speciftcally from the data 
associated with a request beyond the delivery of the associated reply. (This, of courise, does not include stochastic 
information concerning frequency of usage and resource requirements.) Because of the preponderance of stateless 
servers, it is expected that cases of shared memory usage in RPC will strongly favor client jDassihg of control ori the 
request and recovery of control on the reply. The client may either pass information in on the request or recover it on 
the reply or both. Fig. 30 are diagrams outlining the steps underlaken to establish shared merhory and the protocol for 
passing data back and forth. 

The eitablishrnerit of a ishared memory riBgibn involves the message ipassihg library 220 creating data structures 
to hold inforrhation regarding the size and location of the share'regibh in the locarspace and details on the whereabouts 
of the |-egioh in the renriote task. The information is attached to the taisk structure bf b^ One firial vfernirig: 

Multiple mappifigs, mapping the region between rriore th^ri two tastes, resufts in a linked list of share da 
which must be parsed on each ishare region pass. These should be avoided unless there^ is an' express need for broad- 
cast. ' ' ' ' ' ■ ■ - ; - . . . . . : ,. : 

It is important to note that a shared memory region cannot be created without the active support arid knowledge of 
^5 both the sender and the receiver. Both the sender and the receiver 'will be aware of the size and ibcatidn of the shared 
region Within their own space. In this way, if ei server did not find that a particuiar blient was trusted enough, it coLiid 
decline to accept a client directed share. The shared regioris are also direclibnal in the sense that bnb party establishes 
them and theri offers them up to be shared. This is important because it is often true in the active server e^se that a 
server cannot trust the backing pagerof a region of shared menriory dffered by a'c^ pagermight be 

inactive and cause the server to hang on a memory fault In thiis case, it must be the server who off ers a region. Thb ■ 
region, of course, will be backed by a pager that the server trusts. Because of real time considbrations; client' pager 
issues notwithst^hdirig, it 1$ client directed share will be the rriethod of choice in the passive server model. 

The paLSs'iye model assert^ that the client thread of execution enters the server spabe. In this case, it is the server pager 
that the clientmight not trust Should the thread associated with the client request hang due to pager in 
thread body resources associated with the call could be reclainied through the sarriie mechah ism used to prevent priority 
inversion on blocking requests for other resources. ■ * ' ; ; ; t • 

Message coritrol structure allbws the blient to send the share buffer erther statically; Where its size is described in 
the message control structure or dynamically, where another parameter on the c^ll serves to give the size. OverWrite 
options give the receiver thb ability to direct the ovenwrite region to a specified rbgiorii of mapped memory or allbw' it to 
^0 be placed in formerly urirhapped memory. k o ^ , . 

In Fig. 31 , Message Passing library 220 does not see overWrite buffer on server side; it checks to see if dlient region' 
is actually shared by sen/^er by checking task specific state info setup on share i^eg^^^ ihitialfzatibn. If regicwi 'shaVed.'it 
checks to see if physical share; if sd; just pass on size arid translated address, if not, do the explicit cqpy of dat^: If not 
shared, return an error to client • 

The message passing library 220 is very flexible with respect to the usage of a shared memory region. Either party 
may send data in it. Either party may initiate the call, i.e., if a client existed in the server task space which called a server 
in the client's space, the message passing library 220 would support it. Further, alt or part of a shared region may 6e 
involved in a transfer The client may start a transfer on any address within a shared region and end on any address up 
to the last one contained in the shared legidn. Messages may contain hiuftiple areas from the sames shared region iri 
separate parameters. Data from multiple shared regions may be passed in the same messaqe ^ ' ' 

O^^ w^y message support in an Rf^C systern is aj^t to strike One as ari bxyrriorori; There aire, hbwbver; cbnd'itidhs 
which require support of ne-viay niesfeagbs, e^peclaily on tfie^^^ act K^e serveP paradigm, the server 

thread usually performs a reply with a subsequent receive for the next message, this is firid for tlib' steady stated 
but hbw do you start it? the answer is that you start the server w'lth^ orib way receive/ the thread resource to 
sleep awaiting the arrival of the first message. ■ : r r - 

Server's may find it cbrivbriient to process subsequent messages while a particular message Is blocked, waiting for 
a resbijrce; Fijrt her, they wish tb ijsethb thread >es6urcb assbciat^ 
to do this in the explicitrepfy port case, they will have to do a receive without a reply tb get the next message. When 
the blocked message is finally reactivated, the server finds that it must now do a one-way send to institute the reply. 

There are still RPC semantics associated with these one-way messages. Message control sXrucXure information, if 
sent with the reply, must^ be in the RPC coalesced form. Both in the reply and the request case, the tr^arisfer is f uhda- 
55 mentally under the control of the client side message control structure as outlined in section 4. 1 .3. 

One-way^ RPG send is not supported on the client side, and were it not for thread_abort_notify, it would not be 
necessary to support client one_way RPC receive. Wheri the client is aborted out of a wait on reply in the 
thread_abbrt_notify case, a threadzabort-notify-receivb signal is returned-to-the client 



35 



40 



45 



50 



EP 0 704 796 A2 



10 



25 



re-estab!ish the wait for reply. The requirement to explicitly re-establish a wait for reply will also be experienced if it 
becomes necessary to support the old thread_abort_safely option to return thread_abortl_send and thread_abort_rcv 
as direct signals to the client. (This might be required for compatibility with old interfaces.) It woujd then be necessary 
for the client code to explicitly re-establish the reply wait as in the thread_abort_notify case. 

4.1.12 The Role of an Interface Generator: 

The RPC implementation of the message passing library 220 was meant to s^upport RPC witHput the need of porpT r 
piled front ends. TheJnterface preserited by the nnessage passing library 220 allows proxy routines to s^^^^ 
without expensive ruri-tlme interpretation^ is not to say that the message. passing library 220 wjlf^^^ 
ends, corripiled or otherwise,, but RPC can be supported with .sinnple message control jstructure creat jpn routines, 7^ , [ 

It is fully expected that the RPC interface will run with ayariety of front ends. The MIGtpoj has be.e 
message passing library 220 for backward compatibility. It has been extended to support the increased functionafrty. 
Porting of the MIG front end has proven that the new RPC is a superset of existing functionality as exposed by the MIG 
^5 interface, Other front ends may be ported to provide cpmpatibliity with existing application code Bases, . ^ i„ 

Because the message passing library 220 supports a full RPC interface, a front end need not be a cprite^^^^ 
grammar It may be just a finite state automata This greatly sirnplif ies front end cprripiler design and code^ Whore another . ' 
model or an extended model is desired, hpweyer, there are severalplaces where performance on some rpachihesr^^ 
be improved through front end manipulation, of the message b , .\ . 

20 The message passing library 220 will accept double indirection pointers but under some qrcumstances (j.e./ s^^ 

machines might be extrerpely slow at de-referencing pointers across a protectiori dornain boundary), it mig^^^^ 
to de-reference the pointer, in user space possibly evon twice, making, it direct data. This wouj^jrequjre the rpe^^^ ' ' 
option prescribing the writing back of direct data to the client. Dynamic data fields could be turned into s 
This would require the message control structure be written with each pass since it is, here that static, buiffer jjsngth^^ 
kept, but it would xempye the needior a separate CQuntpararneter Finally, ajl direct pararneters could be coa 
into a single direct data field, reducing the parsing time necessary in the message passing Jibrary :^0. , , 

It is .r»pt,expected that any of these optipns^ w better performance iri ariy But the rarest cases, b^^^^^ 

they are mentioned in order to shpw.the^f^^ \. V 

The RPC interface does not collect server interfaces together or create the .proxy function tables ^^a^^ nriessage 
control structure tables. It is convenient to have a front end that can automate this. A f ront end rpay also prove ^t^^^^^ \ 
place to dp data transtorririatipns. And most importantly, frmt end support should be cpordirtated wiith QpTresidef^^ 
requirements, making the production of remote and local proxy libraries as autprr^ated as ppssibje. /; y. \ ' ■ . | 

Although a specific ennbodiment of the jnventipn has.beeri disclosedv 'rt vyill be^ 
the art that changes can be made to that specific embodiment without departing frorti the spirit and scope pf.the invention. 

Clairns . . . . . '. ./ ' y" 

1 . A method for interprocess communication in a nriicrpkemel architpcture data processing ^system, cprnprising the , 
steps of ■ ^ " , , , ^ 

loading a microkernel including a capability engine module into a memory of the data processing system; 
forming with said microkernel a first task container in said memory haying a sgt of attributes defining a flrst . 
communication port and a first set of port rights, and having a pointer to a memory object, said first sef of poii rights 
cpnferringa cagaM 

, - , . forming, with, . said rn a secorid taskxprital^^^^ a set of attributes defininig a 

second connrTiunication.pprt and a^secpnd set pf port rights;^ ' ' . | ' 

, . registering in said, cap^bi%^^ port rights for said first task cbn^^ 

sot of port rights for said second task container; 

cornparing said first set of port rights and said second set pf port rights in said capability engine; 
50 , .. and enabling a transf er with said capability engine of said pointer and said first port rights from said first task , 

container to said second task container to confer onto said second task container said capability to access said 
memory object 

2. A method as claimed in clairn 1 , wherein: said data processing system comprising in a shared nrierripry multipl^^ 
55 essor system; 

including a first and a second processor; 

said first task container has a thread that runs, on saidTirst processor; 

said second task container has a thread that runs on said second processpr and the enabling step enables 



30 



45 



EP 0 704 796 A2 



said first processor to communicate with said secohd processor 

3. A method as claimed In claim 2, wherein: the first and second processors, each run a microkernel operating system; 
the method comprising the steps of : 

loading a first microkemel into a memory of a first proceJssbr in a distributed data processing system, said first 
microkernel includiiig capability engine module; 

forming with said first microkemel a first task container in said memory having a set of attributes defining a 
first comrriunicati^ and a first set df port rights, and having a pointer to a hnierhory object, said first set of port 
i-ights conferring a capability on sa^^^ first task container to access said memory object, said first task container 
haying a thread that runs on said first prc^ 1' L 

forming witH said fir^l m^^^ 

a second communication port and a second set of port rights, said second task container having a thread that runs 
on an I/O communications processor, said I/O processor coupled by a communications lirik to a second data proc- 
essor running a second microkernel op^ 

enabling a transfer with said capability engine, of said pointer and said first port rights from said first task 
container to said second task contaiiner to confer onto said secohd task container said 6apabillty^t^^ said 
memory object, thereby enabling said first processor to comiriunicat^ with said l/b'borfimunicatibhs processor; 

transferring said p^^ arid said first port rights fronn^id 1/6 comrn^^^ 
processor to confer onto said second processor said capability to access said merno^ object. 

4. A subsystem for interprocess comrriunicatioh in a rriicrokemel archyctUre d^^^ system, comprising: 

a microkernel in a memory of a data processing system, including a ciapability engine module; 

a first task container in said memory having a set of attributes defining a first communication port and a first 
set of port rights, and having a pointer to a memory object, said first set of port rijghts conferring a capability ori said 
first task container to access said memory object; 

a second task container in said memory having a set of attributes defining a second communication port and 
a second set of port rights; \ 

said capability engine registering said first set of port rights for said first task cdhtainer and said second set 
of port rights for said second task container; 

said capability engine comparing said first set of port rights ^hd s^id second set of port rights; and 

said capability enabling a transfer of said pointer and said first port rigHtsli^om said first task cohtaiher to said 
second task container to confer onto said second task container said capability to access said memory object, 

5. A subsystem as clairried in claim 4 wherein the data processing system cbmpnses jn a shar^ meniory multiproc- 
essor system running a microkemel operating system, the rriultiprccesspr sysltenri^ having 

a first processor and a second processor sharing a rh^rholry said secohd t^sk container has a thread that runs 
on said second processor; 

said capat)ility engine enables said first processor to communicate with sa\d second processor. 

6. A subsysterh as claimed in claim 5 wherein: ; . 1 ,/ J . ! 

, , the first and second processors are coupled by a cornmunications link and each ruh^ a micrbkerKei bp^ 

^ system;. ^'^^^^^^^ \ .'^^^^^..[[ Z.~'.' .^^^^^^^^^ '-"/'VV'^. 1^:.^.^ ^^. ^.^.^ . ■■ ^' 

' ' alirsf mfe^ of said first processb^^ fe'^micfekSnnel including a c^ 

module; 

a first task container in said memory having a set of attributes defining a first communication port and a first 
set of port rights, and having a pointer to a memory object, said first set of port rights conferring a capability on said 
first task container to access said memory object, said first task container having a thread that runs on said first 
processor; 

an I/O processor sharing said memory with said first processor, and coupled by said communications link with 
said second processor; 

a second task container in said memory having a set of attributes defining a second communication port and 
a second set of port rights, said second task container having a thread that runs on said I/O communications proc- 
essor; capability engine registering said first set of port rights for said first task container and said second set of port 
rights for said second task container; said capability engine comparing said first set of port rights and said second 
set of port rights; said capability engine enabling a transfer of said pointer and said first port rights from said first 
task container to said second task container to confer onto said second task container said capability to access 
said memory object, thereby enabling said first processor to communicate with said I/O communications processor; 
said-second-processor running a second microkerneLoperating.system; and saidl/0_communicatims_ process^ 



EP 0 704 796 A2 



transferring said pointer and said first port rights to said second processor to confer onto said second processor 
said capability to access said memory object. 

7. A system for interprocess communication in a microkernel architecture, comprising: 

5 a memory means in a data processirig system, for storing data and programmed ihstnictions; 

an interprocess communications means in said memory means, for coordinating nnessage jpassing between 
tasks in said memory means; ^ , , 

. . ,a first task in said memory means having a set of attributes defining a first communication port and a first set 
of port rights, and having a pointer to a memory object, said first set of port rights conferring a capability oh said 
10 first task to access said memory object; , „ ^ , . 

a second task in said memory means having a set of attributes defining a second communic^Uqn^ 
second set of port rights; . , \ 

a processor rneans coupled to said memory means, for executi^ , 
a first thread in said memory means associated with said first task, for provicJing s^kJ prbgrannmed Instructions 
^5 for execution in said prpcessornrieans; , 

a capabiiity engirie means in said interprocess comrnunications means, tor registering said fi^^r^^ 
rights tor said first task and said second TV. , , 

said thread providing a nriessage from said fir^ interprocess com^ 

said pointer to said second, task; , . . ^ V . A 

20 said capability engine means comparing said first set of port rights and said second set of port rights; and said 

capability engine means enabling a transferof said poiriter and said first pprt rights from said first task to said second 
task to confer onto said second task sa^d capab^ 

8. Thesystem of claim ?, which further comprises: ^ 

25 said first task represents an appiication program. 

9. The system of claim 7, which further comprises: 

said first task represents an operating system personality pr 

30 10. The system of claim 7, vyhich further cqriprises^ . / : . 

said first task represents ' \ , TT 

11. The system for interprocess comrhunication in a microkernel architecture of claim 19, vvhich further comprises: 

a secorid processor means coupled to said memory means, for executing saicj programrned instructions; 
35 a second thread in said memory means associated witfi saici second task, for providing said progrannhned 

iristructipns for execution in said sec ■ / ^ - 

12. The system of claim 7, vyhtch further comprises: ■ ,V 

said memory means and said processor means being in a first host system of a distributed processoi' system; 
40 a communications link, for coupling said processor means in said first hqst sy^ter^ 

of .said distributed processor system; , , ; . . . , ...-i. .... . ■ 

a second processor means tn said second host system, coupled to said processor means in said first host 
system over said communications link, tor exchanging a reference to said pointer over said communications link. 

45 



50 



55 



EP 0 704 796 A2 



FIG. 1 

M EMORY 102\ 



HOST MULTIPROGESSbR 100 

/ 



115 



DOMINANT PERSONALITY 
APPLICATIONS 




ALTERNATE PERSONTALItY 
APPLICATIONS- 


152 \ 


156^ 186^ 



DOMINANT PERSONALITY 





OTHER 


DOMINANT \ 


DOMINANT 


PERSONALITY; 


PERSONALITY 


SERVER : 


SERVlOES 



7 



154' 155 ( 

142, 



7^ 



ALTERNATE PERSONALITY 



ALTERNATE . 
PERSt)NALITY 
SERVER ^ 



^ OTHER 
C ALTERNATE 
PERSONALITY 
SERVICES 



144 



158 



r 



159 



7 



146 

21. 



148 



MULTIPLE 
PERSONALITY 
SUPPORT 



MASTER 
SERVER 



INITIALIZATION 



NAMING 



DEFAULT ! 
PAGER 



DEVICE 
SUPPORT 



MULTIPLE 
PERSONALITY 
SUPPORT 



DEVICE 
DRIVERS 



OTHER PNS 
PRODUCTS 



FILE 
; SERVER 



NETWORK 
SERVICES 



DATABASE 
ENGINES 



SECURITY 



I \ 



180 



150 



140 







MICROKERNEL 120 




IPC 


VIRTUAL 


TASKS & 


HOST AND 


I/O SUPPORT 




MEMORY 


THREADS 


PROCESSOR 


& INTERRUPTS 


122 


124 


126 


SETS 128 


130 


MACHINE DEPENDENT CODE 125 



AUXILUARY 
STORAGE 
106 




BUS 104 





EP 0 704 796 A2 



FIG. 2 



HOST MULTIPROGESSOR TOO 

/ 



MEMORY 102 





processor set natfne p^rt 





ir-v: : 






'task 











■ oer 



receive right 



jaaNEAT i JAUTSUV 



BUS 104/ r . 



1 

AUXILLIARY 
i STORAGE 
106 





EP 0 704 796 A2 



FIG. 3 



HOST MULTIPROCESSOR 100 



MEMORY 102 



port name space 




task port 



processor 
set 



processor set name port 



tl^read... 



hostTiame poftv 



send right 
|s5^-. recefve right 




.a<jk*e5ts space 




9 



task 



scheduling parms 



task set/ port 



task exception ports 



special ip<^ 



processor set 



suspend count 



statistics 



contained threads 



host self port 



special attributes 



special ports... 



f- 



AUXILUARY 
STORAGE 
106 



BUS 104 



I/O 
ADAPTER 
108 



PROCESSOR 
A 

110 



PROCESSOR 

; B 
: 112 



EP 0 704 796 A2 



FIG. 4 



HOST MULTIPROCESSOR 100 



MEMORY T02 



•send 
rights 



^ send right 
receive right 



port 



right 


message 


special 


count 


queue 


attributes 



pprt/memory 
transfer limrts 



receive 
right 




Additidral po^ 
optionally serS as part 

pf message^^^ 




AUXILLIARY 
STORAGE 
106 



BUS 104' 



I/O 
ADAPTER 

1108 



PROCESSOR 

A - . 
110 



PROCESSOR 

■i 1 12"^ 



EP 0 704 796 A2 



FIG. 5 



HOST MULTIPROCESSOR 1 00 

/ 



MEMORY 102 





send fight 
receive right 




1 

(index] 






r 


: odrt name snact^ 








port right 


port right , 


port right 




nuff 


port set 


nend right count 


s«^d right count 


send right count 








(if send right); 


(if send right); 


(if send right); 








make-5end count 
sequence numbeir 


make^er^ count 
seqbehce nu 


make-send count 
sequence number 






set member! 


port set ; 


port set 


port set 








(if receive right) 

V 


(if receive jight)^ 


(if receive right) 






J 



port 



T' 

i 




port 




port 



AUXILLIARY 
STORAGE 
106 



BUS104 



I/O 
ADAPTER 
108 



PROCESSOR 
A 

110 



PROCESSOR, 
B 

I 112 



EP 0 704 796 A2 



FIG. 6 



HOST MULTIPROCESSOR 100 



MEWORV102 



«^ send right virtual addr | Ondex) 
receive right - ; 




task 



virtual address spacfi 



memory range 



inheritance 



protection 



system properties 



memory range 



inhentanee 



protection 



system prq)erties 



meniorv cachft 



memory range 



inheritance 



protection 



system properties 



memory 

manager task 



memory ' 
tache \ 
object 


at^stract 
object 


name 

^ 


spetciaJ 


known . , 
representatives 
RW R 




abstract memory 
: .. object port 



inemory cache 
.(name port 



..memory cache 
control port 



rriemory bbject 
representative port 



memory object 
representative port 




AUXILLIARY 
STORAGE 
^ 106 



BUS 104 



I/O 
AbAF*TER 
\ 108 



PROCESSOR 
A 

V 110 



PROCESSOR 
B 

t 112 



EP 0 704 796 A2 



FIG. 7 



HOST MULTIPROCESSOR 1 00 



MEMORY 102 



248 



210 



242 



THREAD 



, TASKTCA) 



PORT NAME 

LJ J ( "■ - : 



PORT RIGHTS 



POINTER TO 
DATA OBJECT 



TASK PORT 



MSG INi244 



210' 



DATA 
OBJECT 



248' 



/: ■ : 
/^240 



TASKT(B) 



PORT NAME' 



PQRT RIGHTS 



POINTER 
REGISTER 



TASK PORT 



THREAD 



MSG OUT 246. 





CALL SCHEDULER 232 






MESSAGE PASSING UBRARY 220 



MICROKERNEL 
120 



CAPABILITY ENGINE 300 



TEMPORARY DATA 400 



I 



CONTROL REGISTRATION 500 



ASYNCHRONOUS REPLY 600 



TRANSMISSION CONTROL SEPARATION 700 



IPC SUBSYSTEM 122 





\ V 




BUS 104' 








PROCESSOR 




I/O 




AUXILLIARY 




PROCESSOR 




A 




ADAPTER 




STORAGE 




B 




no 




108 




106 




112 



EP 0 704 796 A2 




EP 0 704 796 A2 



FIG. 9 



Port 



Message 



Task 1 




Send 
Message 
to Port 
Owner 




Capability Engine 



VM Subsystem 



Taska 



EP 0 704 796 A2 



FIG. 10 



300 



\ 



Ihcoming Send Message 
Message Passin g libraiy 



Execution Path: The S^chro^^^ Gasc 
Example 1 E)o send, no receive 




THEGAPABILITY ENGINE 



No Waiting Servci^ - 





Block Sender 



CaU Port Specific 
SendQucucing 
Procedure 



EP 0 704 796 A2 



FIG. 11 

300 



\ 



Inconiing Send Message 
Message Passing Library 



SVC 




Execution Path: The Synchronous Case 
Example 2: Send, receive waiting 

Transfer mes- 
sage, only call 

capability en- 

ginc for trans- 
feiTcd ports or i * 
explicit capa- 
bilities 



Found Waiting 
Server 




Do Thread 
Handoff ^ 



Do a thread hand- 
ofF» call scheduler 
if apjproimatc. call 
cjlipability engine 
\^ to put blocked cli- 
7^ cnt xm queue if *e- 
ply explicit, else 
keep track of client 
at library level 



The Capability Engine 



FIG. 12 



300 



\ 



Incoming Receive ; 



Execution Patlj: TRic Sp^ Case 
Example 3: Incpmirig ftcccivc, no sends 



Message Passing Lita^ 

na waiting sender 

Block receiver 



Get Sender 





The Capability Engine 



The capability fiinc- 
don called docs not 
check the send 
<^pe, it ealls aPport 
spadfic queueing 
function. It is up to 
the Library to lock 
the associated port. 
To protect against 
senders arriving af- 
ter the check. 



EP 0 704 796 A2 



FIG. 13 



300 



\ 



Ificximmg R 
Message Passing MbraSy 



Execution Path: The Synchronous Case 
Example 4; Incopoing Receive, waiting send 



Get Sender 




Return Sender from 
Port Spanfie queue 

•A J': 



transfer rnessage, 
only call capabili- 
ty engine for ports 
and explicit capa- 
bility / manipula- 
tion 




The Capability Engine r 



Call Sched- 
uler if ap- 
propriate, 
only call ca- 
pability en- 
gine if reply 
port is ex- 
plicit 



FIG. 14 



300 



Incoming Send 



Message Passing Libiary 

> No receiver ^ 

Get Receive] 



Execution Path: The Asynchronous Case 
Example 1: In coming send, no receive 





Return 



^ queue message 



.... 



The Capability Engine 



EP 0 704 796 A2 



. 15 



300 



Incoming Send Message -s. 
Message Passing Library 



Get Rccciv 




Execution Path; The Asynchronous Case 
Example 2: Send, receive waiting 

Transfer mes- 
sage, only call 
capability en* 
gine for trans-' 
fcired ports or 
explicit capa- 
bilities 



Return rcvr from . 
port specific queue 



The Capability Engine 



Setfthe priOT- 
ity info of ie 
ceiver 



-^condingf 



ac- 
to 

mcs;sage. Call 
the scheduler, 
re- turn the 
send thread 



EP 0 704 796 A2 



FIG. 16 



300 



\ 



Execution Path: The Asynchronous Case 

T - T> - Examples: Incoming Receive, no sends 

Incoming Receive 

Message Passine Library 

\ no waiting message 

_ \ Block receiver 

Get Message^ J \^ 



The C^ability Engine 



The capability fiinc- 
tion called does not 
check the send 
queue, it c^ a peat 
specific queucing 
function. It is up to 
the Library to lock 
the associated port. 
To protect against 
senders arriving af- 
ter the check. 



Fl® . 1 7 



300 



Incoming Receive 



Message Passin 



Get Message 




Execution Path: The Asynchronous Case 
Example 4: Incoming Receive, waiting send 

Translate pons and 
capabilities where 



Return Msg from 
Pon Specific queue " 



present m mes 
sage, mapcapabil 
ities where by-rcf- 
ercnce is expected. 

xNf 




The Capability Engine 



Alter receiv- 
er priority 
with mes- 
sage sched- 
ule informa- 
tion (if asked 
for) Call the 
scheduler, 
return the re- 
ceiver. 



EP 0 704 796 A2 



FIG. 18 



Send/RCVi 
IPC/RPC 
Etc 



Optional 
Transmis- 
sion info 



Message Header 



Message Data 




I>iiect Data 

Ports 
Capabilities 
By reference- 



Message Parameter Information 



Individual 
Parameter 
information 



Optional 
Mapped 
Mcinoiy 
Region 




EP 0 704 796 A2 



FIG. 19 



Typical Call of a By Proxy User Level Uhrary 



Application believes it is 
calling XYZ locally. Instead 
a by proxy sets up an RPC to_ 
a remote copy of I 



Proxy Function 



Create Message 




Message Header 
Most likely created 
on stack 



FIG. 20 



Message Control Structure 



Paranactcr Count: 




Count of parameters in noessage, also count of pninaiy descriptors 

Primary descriptors, fixed size, one to one mapjung widi die 
parameters in message, may point at secondary descriptor 



Secondary Descriptors, no particular order 



EP 0 704 796 A2 



Fl©. 21 



Message Header Structure 



Rxcd Portion 



Optional Dirfa 
Data 



mporary ^y-<cf 
Data Area 



\ 




by-rcfcrcrbc ^ 
optional trans- \ 
naissioh fields 



Field 

1 Message Buffer 




2 


Message Size 




3 


Header and Temporary BufiFcr Size 




4 


Message Q>nm>l Structure Size 




5 


Message Control Structure Pointer 




^ 6 


Opdonal Transmission FlagT^^ 




■ 


3 SEND/RCV pon Disposinoit (dcsPaiid r 
, ply pon disposition) 
McsMgePointcr , ' f 




10 


Messago%pc: Scnd/Rcv. Nodfy^&i f 




11 


Reply Port 





EP 0 704 796 A2 



FIG. 22 

Example 1: Trusted Client/Known Message ID 




CapabDity Engine 



MCS Message Control Structure 



Adjustcfd Header 
includes message 
ID 



Translated 
I j Message 

Message Sent back 
to server 



FIG. 23 




Example 2: Non-ttustcd Oicnt/Known Message ID 
& Unknown CliM^ii^c fodat * 









Message 




Passing 




library 



Scrva^ requests 
Message con- 
trol Stnictarc 



Capability Engine 



MCS -> message control structure 



|us|5dlfc^cr ! 
iSpwdc^ mcssiage ] 



Client 



MCS 



j I Translated 
' ^ Message 

Message Sent back 
to server 



EP 0 704 796 A2 



FIG. 24 



Message Fonmt Registration 



Server Registration Request 




Control Stnictnrc 



Registration ID 




Sciycr services client request 
|St?n|T cheeks com- 
|pltib|lity and returns 
|n%islratioa to client 



p3Se^ the' 




Message J^ing liibra^ 
Message Control Structure with port 
level array of message control struc- 



tures. 



Port 



ge 
Cpontrol 
Structure 



Index = 



on 



ID 



New 

Gocitrol 
Structure 




Message JRassing Mbnuy passes the 
Message to servo- 




Message Passing libtBry^passes U>e^ 
Registration ID from server to client 



EP 0 704 796 A2 



FIG. 25 



Capability or pcnnanent 
by reference field 



Capability or permanent 
by reference field 

C^ability qivpenn^cnt 
by reference held 

Capability or permanent 
by reference field 

Capability or poTnancni 
by reference field 




I Outgoing Message^ 



flai^ge^field moM;^ - 

Capability 
LOC X Pcnnanent by-rel 
LOC Y Permanent by-rci 



Receiver Header 



1 Capability 

2 (jathcr 2 (LOC X) 

3 Gather,UPT05 
(LOC Y) 



1 1 

11 



EP 0 704 796 A2 



FIG. 26 



I RPC Version of Message Pass 



mg 



User Level Library 




I Call specific Message Dcscriptic 
Message Hcade 



Mesisage 

Application 



User Task 1 



IPC trap for Rcph 



User Ta sk 2 



Low Level IPC Code 



Cc^ies in nlessage data stiuc- 
(whcn' necessary) and di- 
rect message data. **(Uscs 
?i)si<» for .tcmppiaiy stor- 




Main IPC Code 

Uses PORTS to detcnnine tfic destina- 
tion TASK. Operates on THREADS or 
THREAD BODIES doing scheduling 
handoff or makes an explicit SGI&)- 
ULECALL " ' ~ 



Kernel 



4^ II|C, write out 

message, freeipNE storage 



Owner of Receive Right for p estination Port 



Server 



Header 




Message 






Server Routine 



EP 0 704 796 A2 



FIG. 27 

Queue Support TThrougli t^ 

Supplied Queueing Routine 



SVC call (with or without 
Jjtinuation routine) 





Customizable^ 
link 




Queue incbuung thread 
Possibljr accotid^ to 
schcdiilmg^HUity 



Check for sarver resource 
NONE available^ 

: ^ : , ^ 

Consult port structure-: ^ 


Port f 




Send queue 
proced ure ^ 


;/CapahiIiitf 


Call queueing procedure 




Engine 



Check Me$s2^c 
GaU Rincik^^E^ 

Seitup Receive buffer 
HeadcTv overwrite 

DoasendAcv 



w Proxy function 
< Call Endpoint 
function 



J Priniary Server Loop 



Endpoint 
funqtion be- 
haves as if 
local to call- 



EP 0 704 7^6 



^N^ge ftssing Libraiy AnonjTnous Reply Algorithm 

is client explicit? 





is server explicit? 



NO 

Use the fully anonymous 
reply block the client with- 
out consulting the capabili- 
ty engine. Point at the cli- 
ent from the server thread 
structure. 



1 





is server explicit? 



NO 

Put the client to sleep on the 
explicit rq>ly port Point at 
the client firom the server 
thread structure. 



YES 



Place the cxpUcit pon's send 
OT scnd_once right, according 
' =t6fcU6h1ifepiy disposition, in 
server space and send reply 
"ghi iiMn? in wdth the request.; 



: afliaMenymousirepIy port 
Place a send once ligljt td tlic^n 
in server space. Sendlihdexplicit 
right name with the requcsL 



EP p 704J96 A2 



FIG. 30 



Share Region Initialization 



Client sends a by-reference 
region with share option on. 




SOURCE 



Descriptor byH:cfcrencc 
share 



, address of bjrrcfcr- 
region ' * * * ^ 



Server Must accept share region ex- 
plicitly through an overwrite option 



TARGET 



Ovcrvyjitcr 

fueCshare 




Header 



' Overwrite 
Buffer 



FIGi31 



region share optkm on. 



Header 



Message* 
Csmsml 



By ref- 
erence 
Region 



rBirffnr 



SOURCE 



ion Usage, RPC dommon Case 



pescripK| lor^^ 

^^^^ 



^address oi tsy pcIct- 



chcc 
ggjffll 



Server sends normal 
receive header 



Header 



Server receives indication 
data written into pre-es- 
tablished share region. 



